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Abstract 

We establish oracle inequalities for a version of the Lasso in high-dimensional fixed effects 
dynamic panel data models. The inequalities are valid for the coefficients of the dynamic 
and exogenous regressors. Separate oracle inequalities are derived for the fixed effects. Next, 
we show how one can conduct uniformly valid simultaneous inference on the parameters of 
the model and construct a uniformly valid estimator of the asymptotic covariance matrix 
which is robust to conditional heteroskedasticity in the error terms. Allowing for conditional 
heteroskedasticity is important in dynamic models as the conditional error variance may be 
non-constant over time and depend on the covariates. Furthermore, our procedure allows for 
inference on high-dimensional subsets of the parameter vector of an increasing cardinality. 
We show that the confidence bands resulting from our procedure are asymptotically honest 
and contract at the optimal rate. This rate is different for the fixed effects than for the 
remaining parts of the parameter vector. 


1 Introduction 


Dynamic panel data models are widely nsed in economics and social sciences. They are ex¬ 
tremely popnlar as workers, firms, and countries often differ due to unobserved factors. Fur¬ 
thermore, these units are often sampled repeatedly over time in many modern applications thus 
allowing to model the dynamic development of these. However, so far no work has been done 
on how to conduct inference in the high-dimensional dynamic fixed effects model 

L 

yi,t = X + 1^ + £i,t, i = 1, A, andf = l,...,r (1.1) 

1=1 


where the presence of L lags of t/jy allows for autoregressive dependence of in^t on its own past. 
Xi^t is a Pa; X 1 vector of exogenous variables and r]i,i = 1,...., A are the N individual specific 
fixed effects while Si^t are idiosyncratic error terms. Applications of panel data are widespread: 
ranging from wage regressions where one seeks to explain worker’s salary, to models of eco nomic 
growt h determining the factors that impact growth over time of a panel of countries as in [Islam 
199,^ 1. 


Recent years have witnessed a surge in avai l ability of big data sets including many ex¬ 
planatory variables. For example, iDe Neve et ah! (|2012l l have considered the effect of genes on 
happiness/life satisfaction. Controlling for many genes simultaneously clearly results in a vast 
set of explanatory variables, hence calling for techniques which can handle such a setting. High- 
dimensionality may also arise out of a desire to control for flexible functional forms by including 
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various transformations, such as cross products, of the availa ble explanatory variables. In the 
specific context of panel data models I Andersen et al.l (( 2013 ) investigated the causal effect of 
lightning density on economic growth using a US panel data set. These authors had access to 
a big set of control variables compared to the sample size. For this reason, they decided to 
investigate the effect of lightning using several subsets of control variables instead of including 
all control variables simultaneously as one would ideally do. In this paper we show how one 
can achieve this ideal by proposing an inferential procedure for high-dimensional dynamic panel 
data models. 

Much progress has also been made on t he methodo l ogical side in the last decade. Among 
the most popular procedures is the Lasso of Tibshirani ( 199fil l which sparked a lot of research 
on its properties. However, until recently, not much work had been done on inference in high¬ 
dimensional models for Lasso-type esti mators as these posses s a rather complicated distribution 
even in the low dimensional case, see Knight and Fu ( 20nfll L This problem has been c leverly 
approached by un penalized estimation afte r double selection by Belloni et al.l (120121. l2014li or bv 
despa rs ification in Zhang and Zhang ( 20141 1: van de Geer et al. ( 20141 1: .lavanmard and Montanari 
( 2 OI 3 I I: Caner and KockI ( 2014l l. 

The focus in the above mentioned work has been almost exclusively on independent data 
and often on the plain li near regress ion m odel while high-diin ensional panel data has not been 
treated. Exceptions are Kock ( 201, *11 1 and Belloni et al. ( 2014l l who have established oracle in¬ 
equalities and asymptotica lly valid inference for a low-dimensional parameter in static panel 


data models, respectively. Caner and Zhang ( 2014l l have studied the properties of penalized 


GMM, which can be used to estimate dynamic panel data models, in the case of fewer pa¬ 
rameters than observations. To the best of our knowledge, no research has been conducted 
on inference in high-dimensional dynamic panel data models. Note that high-dimensionality 
may arise from three sources in the dynamic panel data model dEH). These sources are the 
coefficients pertaining to the lagged left hand side variables (a;), the exogenous variables (/3), 
as well as the fixed effects (r/j). In particular, we shall see that (joint) inference involving an 
rji behaves markedly different from inference only involving a;’s and j3. Furthermore, panel 
data differ from the classic linear regression model in that one does not have independence 
across t = 1,...,T for any i as consecutive observations in time can be highly correlated for 
any given individual. Ignoring this dependence may lead to gravely misleading inference even 
in low-dimensional panel data models. For that reason we shall make no assumptions on this 
dependence structure across t = 1,..., T for the Xi^f Static panel data models are a special case 
of (II.ip corresponding to = 0, I = 1, 

Traditional approaches to inference in low-dimensional static panel data models have con¬ 
sidered the N fixed effects rji as nuisance parameters which have been removed by t aking either 


first d if ferences o r dem e aning the dat a over time for each individual i, see e.g., IWooldridge 


( 2010IL Arellano ( 20031 ): Baltagi ( 20081) . In this paper we take the stand that the fixed effects 
may be of intrinsic interest. Thus we do not remove them by first differencing or demeaning. 
This allows us to test hypothesis simultaneously involving a, (5 and r]. 

The two most common assumptions on the unobserved heterogeneities, iji, are the random 
and fixed effects frameworks. In the former, the r]i are required to be uncorrelated with the 
remaining explanatory variables while the latter does not impose any restrictions. Ruling out 
any correlation between the r]i and the other covariates often unreasonable. In this paper we 
strike a middle ground between the random and fixed effects setting: we do not require zero 
correlation between the unobserved heterogeneities and the other covariates, however we shall 
impose that (r/i, ....,r/Ar) is weakly sparse in a sense to be made precise in Section [221 We still 
refer to the rji as fixed effects as we treat them as parameters to be estimated as is common 
in fixed effects settings. However, the reader should keep in mind that our setting is actually 
intermediate between the random and fixed effects setting. 


In an interesting recent paper dealing with the the low-dimensional case. lBonhomme and Manresa 
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( 20141 ) have assumed a different type of structure, namely grouping, on the fixed effects. How¬ 
ever, in the high-dimensional setting we are considering, weak sparsity works well as just ex¬ 
plained. 

Our inferenti al procedure is clo s est in spirit to the one in Ivan de Geer et al.l (j201J) , which 
in turn builds on IZhang and Zhand (j2014l ) , who cleverly used nodewise regressions to desparsify 
the Lasso and to construct an approximate inverse of the non-invertible sample Gram matrix in 
the context of the linear regression model. In particular, we show how nodewise regressions can 
be used to construct one of the blocks of the a pproximate inverse of the empirical Gram matrix 
in dynamic panel data models. As opposed to Ivan de Geer et al.l (j2014l ). we do not require the 
inverse covariance matrix of the covariates to be exactly sparse. It suffices that the rows of the 
inverse covariance matrix are weakly sparse. Thus, none of its entries needs to be zero. 

We contribute by first establishing an oracle inequality for a version of the Lasso in dynamic 
panel data models for all groups of parameters. As can be expected, the fixed effects turn out 
to behave differently than the remaining parameters. Next, we show how joint asymptotically 
gaussian inference may be conducted on the three types of parameters in dni). In partic¬ 
ular, we show that hypotheses involving an increasing number of parameters can be tested 
and provide a uniformly consistent estimator of the asymptotic covariance matrix which is ro¬ 
bust to conditional heteroskedasticity. Thus, we introduce a feasible procedure for inference 
in high-dimensional heteroskedastic dynamic panel data models. Allowing for conditional het¬ 
eroskedasticity is important in dynamic models like the one considered here as t he con d itiona l 
variance is known to often depend on the current state of the process, see e.g. lEnglel ()l982l ). 


Thus, assuming the error terms to be independent of the covariates with a constant variance is 
not reasonable. Next, we show that confide nce ba nds constructed by our procedure are asymp¬ 
totically honest (uniform) in the sense of In ( I989I ) over a certain subset of the parameter space. 
Finally, we show that the confidence bands have uniformly the optimal rate of contraction for 
all types of parameters. Thus, the hon esty is not boug ht at the price of wide confidence bands 
as is the case for sparse estimators, c.f. IPotscherl ( 20091 ). Simulations reveal that our procedure 
performs well in terms of size, power, and coverage rate of the constructed intervals. 

The rest of the paper is organized as follows. Section [2] introduces the estimator and provides 
an oracle inequality for all types of parameters. Next, Section [3] shows how limiting gaussian 
inference may be be conducted and provides a feasible estimator of the covariance matrix which 
is robust to heteroskedasticity even in the case where the number of parameter estimates we 
seek the limiting distribution for diverges with the sample size. Section 0] shows that confidence 
intervals constructed by our procedure are honest and contract at the optimal rate for all types 
of parameters. Section [5] studies our estimator in Monte Garlo experiments while Section [6] 
concludes. All the proofs of our results are deferred to Appendix A; Appendix B contains 
further auxiliary lemmas needed in Appendix A. 


2 The Model 

2.1 Notation 

For X G M”, let ||x||o = / 0), ||x|| = xj, ||x||i = and ||x||oo = 

maxi<j<„ \xi\ denote the Iq, l^-, and £oo norms, respectively. Let denote the unit column 
vector with mth entry being I in some Euclidean space whose dimension depends on the context. 
If the argument of || • ||oo is a matrix, then || • ||oo denotes the maximal absolute element of the 
matrix. For some generic set R C {I,...,n}, let xr G denote the vector obtained by 
extracting the elements of x G R” whose indices are in i?, where |i?| denotes the cardinality of 
R\ = {1,... ,n}\R. For annxn matrix A, Aji denotes the submatrix consisting of the rows 
and columns indexed by R. 0 is the Kronecker product. Let a V 5 and a A 6 denote max(a, b) 
and min(a, b), respectively. For two real sequences (a„) and (bn), an < bn means that Un < Cbn 
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for some fixed, finite and positive constant C for all n > 1. For two deterministic sequences an 
and bn we write an bn if there exist constants 0 < ai < 02 such that aibn < an < a 2 bn for 
all n > 1 . sgn(-) is the sign function, maxeval(-) and mineval(-) are the maximal and minimal 
eigenvalues of the argument, respectively. For some vector x E diag(x) gives anxn diagonal 
matrix with x supplying the diagonal entries. 

The model in (jl.ip can be rewritten as 

yi,t = + r]i + Ei^t, i = l,...,N, t = (2.1) 

where ■= • • •, yi,t-L, J and a := (ai,..., aL,l3')' are p x 1 vectors {p = + L). 

Note that the dimensions of L, px and p can vary with dimensions N and T but in general 
we suppress this dependence where no confusion arises. We assume that initial observations 
Uifi, Ui-i, ■■■, yi,i-L are available for i = 1,..., N. 

The three sources of high-dimensionality in (|2.ip are px, L and N as all of these can be 
increasing sequences. Sometimes one thinks of the number of lags, L, as being fixed and in that 
case only two sources remain. Next, m may be written more compactly as 

Vi = Z-a + r]ii + Ei, 

where Z* := {zi^i ,..., is a p x T matrix, y* ;= (yj^i,..., yi^r)', e* := (e*,!, • • •, and c is 

a T X 1 vector of ones. Then, one can write 

y = {Z ) +e = n 7 + e, 

where Z := (Zi,..., Zn)', y := y'jq)' and e := {e\, ..., e(v)'. p := (yi,..., pat)' contains 

the fixed effects, D := In 0 t, and IT ;= {Z,D). Finally, 7 := {a',rj'y contains all p -|- 
parameters of the model. Thus the dynamic panel model dni) can be written more compactly 
as something resembling a linear regression model. There are several differences, however. First, 
blocks of rows in the data matrix IT may be heavily dependent. Second, we shall see that a and 
p have markedly different properties as a result of the fact that the probabilistic properties of 
the blocks of a properly scaled version of the Gram matrix pertaining to 11 are very different. 
Third, imposing weak sparsity only on p implies that the oracle inequalities which 
steppi ng stone towards inference do not follow directly from the technique in, e.g., 

( 2009l i . In fact, we do not get explicit expressions for the upper bounds but instead characterize 
them as solutions to certain quadratic equations in two variables. 

2.2 Weak Sparsity and the Panel Lasso 

Let J\ = {j : Oj / 0, j = 1,... ,p} denote the active set of lagged left hand side variables and 
Xi^t with 1 < Si = |Ji| < p. a is said to be (exactly or f'o) sparse when si is small compared 
to p. Exact sparsity is by now a standard assumption in high-dimensional econometrics and 
statistics. The unobserved heterogeneity, p, is usually modeled as either random or fixed effects. 
The former rules out correlation between p and the remaining covariates. This is often too 
restrictive. In the fixed effects approach no restrictions are imposed on correlation between p 
and the covariates. As explained in the introduction, our fixed effects approach is in fact a 
middle ground between pure random and fixed effects approaches. We choose to call it a fixed 
effects approach as p is treated as a parameter to be estimated. However, p is not entirely 
unrestricted and assumed to be weakly spars^ in the sense 

N 

^\vir<EN 

i=l 

^The term weakly sparse is borrowed from lNegahban et all (l2012l ~). 


we use as a 
Bickel et al. 
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for some 0 < z/ < 1 and = i? > 0. Weak sparsity does not reqnire any of the fixed effects to 
be zero but instead restricts the ’’sum”, E, of all the fixed effects. E can be large in the sense 
that it tends to infinity but the smaller it is, the sharper will our results be. It is appropriate 
to stress that the fixed effects can not be entirely unrestricted ~ that is why our setting is 
middle ground between random and hxed effects. Thus, our framework also excludes many 
models of interest. We believe, however, that our results provide a useful first step towards 
uniform inference in high-dimensional dynamic panel data models and we certainly allow for 
more correlation between rj and the covariates than the random effects assumption does. 

Note that he presence of many control variables in a high-dimensional model leaves less 
variation to be explained by the unobserved heterogeneities and these are therefore likely to 
be small in magnitude making the weak sparsity assumption reasonable. Thus, weak sparsity 
actually becomes more reasonable the larger the number of control variables is. 

Weak sparsity is a strict generalization of exact sparsity in the sense that if only S 2 elements 
of r/j are non-zero and none of these exceeds a constant K, then IVil'' ^ S 2 K'' . Thus, 

E = S 2 K’^ works. Alternatively, exact sparsity of 77 can be handled as the boundary case = 0 
upon defining 0^ = 0 such that E will equal the number of non-zero entries of r/. 


2.3 The Objective Function and Assumptions 

Our starting point for inference is the minimiser 7 = [a 'of the following panel Lasso 
objective function 

m = to - n7|t + 2A„||a||i + 2^tolli- (2.2) 


As usual Aw is a positive regularization sequence. Note that we penalize a and rj differently 
to reflect the fact that we have NT observations to estimate aj for j = l,...,p while only 
T observations are available to est imat e each r}i- Penalizing the f i xed e ffects is not new and 
was already done in Koenker ( 2004 1 and Galvao and Montes-Roias ( 2O10l l in a low dimensional 
panel-quantile model. Furthermore, the penalization fits well with the weak sparsity assumptio n 
on the fixed effects and may increase efficiency of a as found in Galvao and Montes-Roias ( 2O10l l. 

For practical implementation it is very convenient that we only have one penalty parameter 
Aw instead of having separate penalty parameters for a and rj. The minimization problem can 
be solved easily as it simply corresponds to a weighted Lasso with known weights. However, 
the probabilistic analysis of the properly scaled Gram matrix is different from the one for the 
standard Lasso as it must be broken into several steps. We now turn to the assumptions needed 
for our inferential procedure. 


Assumption 1. 

... ,x[rp, is an independent sequence and 

'E\ei^t\yi,t-i,-,yi,i-L,Xi^t,-,Xi^i] = t) for i = l,...,N, t = l,...,T. 


Assumption [T] imposes independe nce across i = 1 ,... , A which is standard in the panel data 
literature, see e.g. Wooldridge ( 20ld i or lArellanol (1200,41 1. Note however, that we do not assume 
the data to be identically distributed across f = 1,..., N. Assumption [T] also implies, by iterated 
expectations, that the error terms form a martingale difference sequence with respect to the 
filtration generated by the variables in the above conditioning set and thus restricts the degree 
of dependence in the error terms across t (in particular, they are uncorrelated) J1 However, it 
still allows for considerable dependence over time, as higher moments than the first are not 


^It can also be verified that forms a martingale difference sequence with respect to the natural 

filtration for all i = 1,..., This is because the ei,t are (linear) functions of the variables in the conditioning 
set in Assumption [ 1 ] 
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restricted. Furthermore, the error terms need not be identically distributed over time for any 
individual. Note that the increasing number of lags of yi^t also whiten the error terms. We also 
note that Assumption [T] does not rule out that the error terms are conditionally heteroskedastic. 
In particular, they may be autoregressively conditionally heteroskedastic (ARCH). In panel data 
terminology, both lags of yi^t and xi^t are called predetermined or weakly exogenous. Finally, 
one can of course also include lags of the Xi^t as these are also weakly exogenous. 

In order to introduce the next assumption define the scaled empirical Gram matrix 


'Hn = S-^U'US-^ 




where S 


0 ^ 
0 VTItv ) 


When p+N > NT, is singular. However, to conduct inference it suffices that a compatibility 
type condition tailored to the panel data structure is satisfied. To be precise, define for integers 
ri G {1,... ,p} and r 2 € {1,..., A^} 


K^{A,ri,r 2 ) ■■= 


6'A6 


1 


mm mm 

RiC{l,...,p},|iJi|<ri 5eKP+^\{0} , 

R2C{l,...,7V},|H2|<r2||5flc||i<4||5fl||i 




Hwhich is reminiscent of the restricted eigenvalue condition in Bickel et al. ( 20091 1. We will need 
K^('I'Ar, ri, r 2 ) to be bounded away from zero for ri = si and r 2 being a sequence made precise 
in the Appendix A depending on the degree of weak sparsity of the fixed effects. To bound 
K^('I'Ar, si, r 2 ) away from zero consider K^('k,si,r 2 ) wher^ 


( 0 \ ^ / iktz^] 0 \ 

y 0 Iat y y 0 Iat y 


We will see that in order for K^(TAr, si, r 2 ) to be bounded away from zero it suffices that 
K^('I', si, r 2 ) is bounded away from zero and Tat being close to T in an appropriate sense. 
Writing 6 = ((J^,(J 2 )^ where di G and 62 G M^, note that by the block diagonal structure of 
T 

^ + 5'2,R2^2,R2 ^ S'l^zdi 

~ +^2,R2^‘2,R2 ~ ^'i,RiKRi 

The above estimates are useful as they show that we really only have to consider minimization 
over the upper left submatrix in the definition of K^('I'Ar, si, ^ 2 ). To be precise, 


min K^('I', si, r 2 ) > 1 A 


min 


d'l^zdi 


kI{^z,si). 


(2.3) 


Thus, K 2 {'^z,si) is a uniform lower bound for K^(T,si,r 2 ) and in order for k^('I', si, r 2 ) to be 
bounded away from zero it suffices to assume that 

Assumption 2. ('l'z,si) is uniformly bounded away from zero. 


Assumption [2] is rather innocent as it is trivially satisfied when the is positive definite. 
Since is the population second moment matrix of Zi^t this is a rather innocent assumption 
which is typically imposed. Compatibility type conditions are stan dard in the literature an d 
various versions and their interrelationship have been investigated in van de Geer et al. ( 2009l l. 


Assumption 3. There exist positive constants C and K such that 

®Here Ri U R 2 is understood as Ri U {R 2 + p) where the addition is elementwise, 
actnally also depends on N and T but for brevity we are silent about this. 
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(a) Ei^t o,re uniformly subgaussian; that is, F(|ej^ 4 | > e) < \Ke for every e > 0, i = 1,..., N 
and t = 1,... ,T. 

(b) Zig^i are uniformly subgaussian; that is, > e) < for every e > 0, i = 

1,..., N, t = 1,... ,T and I = 1,... ,p. 


In the context of the plain static regression model it is common practice to assume the error 
terms as well as the covariates to be subgaussian. However, this assumption is not as innocent 
in the context of the dynamic panel data model dni) as yi^t is generated by the model and 
its properties are thus completely determined by those of Xig,£i^t as well as the parameters of 
the model. Lemma [2] in Appendix A shows that yi^t is subgaussian if Xi^t and Ei^t satisfy this 
property and the parameters are well-behaved. In particular, a wide class of (causal) stationary 
processes are included. Note also, that Assumption[3]impo ses subgaussianity of the initial values 


yi, 0 ) 2/i,i-L for all i = 1,...,A^. Caner and Kock ( 2014li have derived results similar to ours 


in a cross sectional setting without the sub-gaussianity assumption. However, the dimension of 
their model can not increase as fast as here. 


2.4 The Oracle Inequalities 

With the above assumptions in place we are ready to state our first result. Defining W(si, v, E) := 
{a G RP : ||a||Q < si} x {ry G ^ one has 

Theorem 1 (Oracle inequalities). Let Assumptions [I] - El hold. Then, choosing Atv = 
Y^4MA^r(log(p V N))^ for some M > 0, the following inequalities are valid with probability 
at least 


-.2. i/3n 


1 - - A{p^ + pN) exp | A^/ si + E | 

for positive constants A and B and si + E < ViV, 


NT 




|n(7-7)|r< 


kUNT^ V 


-t20 


Xn 


y/NNT 


E 


WntJ 


1 -u 


|d — a||i < 


, nn\ J_ p ( \ 

4NT J ./N [./NT) 


l-u 


Vn \VntJ 


li < 


120AAfSi ^ ^ 


IVnt V 




Moreover, the above bounds are valid uniformly over F{si,v,E). 

Theorem [1] provides oracle inequalities for the prediction error as well as the estimation 
error of the parameter vectors. While these bounds are of independent interest, we primarily 
use them as means towards our ultimate end of conducting (joint) inference on a and ry. We 
stress that the bounds in Theorem [1] are finite sample bounds; they hold for any fixed values 
of N and T. The novel feature of our oracle inequalities is that E, the ’’size” of ry, is allowed 
to grow even when we want the upper bound of Hiy — ?y||i go to zero. The special case of exact 
sparsity of rj corresponds to = 0 and E being the sparsity index, say S 2 , of ry. 

We also note that the oracle inequalities are not obtained in an entirely standard manner 
as the mixture of exact and weak sparsity in dynamic panel data models calls for a different 
proof technique which yields the upper bounds as solutions to certain quadratic equations. 
Furthermore, we remark that in analogy to oracle inequalities in the plain linear regression 
model the number of covariates in xig (px) may increase at an exponential rate in NT without 
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hindering the right hand sides of the oracle inequalities in being small. Finally, we do not assume 
independence across t = for any individual thus altering the standard probabilistic 

analysis as well. Instead we use concentration inequalities for martingales to obtain bounds 
almost as sharp as in the completely independent case. If one restricts the dependence structure 
of for every i = to be, e. g., strongly mixing th en one can use concentration 

inequalities for mixing processes such as in Merlevede et al.l (2011). Restricting the dependence 
structure this way will allow si and E to increase faster. The focus on the £i-norm in the oracle 
inequalities for a and rj is due to the fact that an upper bound in this norm will be particularly 
useful when developing our uniformly valid inference procedure in the following sections. 


3 Inference 

In this secti on we show how to condu ct inference on 7 and hrst discuss how desparsification as 


proposed in van de Geer et al. ( 20141 1 works i 


in our context. 


3.1 The Desparsified Lasso Estimator 7 

First, observe that L{'y) in (|2.2p is convex in 7 and in order for 7 to be a minimiser of L, 0 must 
belong to the subdifferential of L^j) at 7 , i.e. 


0 G dL(7) = 


-2Z'{y - 117 ) + 2AArki 


-2D'{y-Uj) + 2^k2 


where ki and /t 2 are p x 1 and x 1 vectors, respectively, such that kij G [—1,1] with kij = 
sgn(Q;j) if aj 7 ^ 0 for j = Similarly, k 2 i G [—1,1] with k 2 i = sgn{fji) if 57 ^ 0 for 

i = 1,..., N. Hence, 

-n'(y-n7)+ f 1=0- (3-1) 


Vn 


K2 


Using that y = Hy + s and multiplying by 5 ^ from the left yields 



= S-^n'e. 


In order to derive the limiting distribution of S{'y — 7 ) one would usually proceed by isolating 
~ 7 ) which implies inverting lijv However, when p-\- N > NT, li/y is not invertible. The 
idea of Ivan de Geer et all ^3) and Ijavanmard and Montanarl (l 2 ni.ll l is to circumvent this 
problem by using an approximate inverse of 'I'Ar and controlling the asymptotic approximation 
error. Suppose that a matrix 0 is a reasonable approximation to the inverse of 'I'Ar. We shall 
explicitly construct 0 in the next section. Then we may write 


—lo c—1 


7 = 7 - 5-^05 


AatKi 
A jv 


+ s-^es-^n's- s-^A 


Vn 


K2 


-1 


where A := ( 0 'I' 7 v — l) <5 (7 — 7 ) is the error resulting from using an approximate inverse 0 of 

Atv^i \ 

in the above display is the 


Tat as opposed to an exact inverse. The term S ^QS ^ 


Ajv 

Vn 


1^2 


bias incurred by 7 due to shrinkage of the parameters in ( 12 . 2 p . As this bias term is known one 
may add it back to 7 in order to define the debiased estimator 


7 = 7 + S-^05-^ 


Aivki 


= -f + S-^GS-^U'e-S-^A 


— Itt', 




Vn 


K2 
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The new estimator 7 is no longer sparse as it has added a bias correction terms to the sparse 
Lasso estimator 7 . Therefore, we will also refer to it as the desparsified Lasso estimator in the 
dynamic panel context. 

S (7 - 7 ) = 0,S-in'e - A, (3.2) 

For any {p + N) x 1 vector p with ||/ 9 || = 1 we shall study the asymptotic behaviour of 

p's (7 - 7 ) = p'QS-^U'e - p'A. (3.3) 

A central limit theorem for p'0S'“^n'e as well as asymptotic negligibility of p'A will yield asymp¬ 
totically gaussian inference. Furthermore, we shall provide a uniformly consistent estimator of 
the asymptotic variance of / 9 ' 0 S'“^n'e even in the presence of conditional heteroskedasicity. A 
leading special case of (j3.3ll is when one is only interested in the asymptotic distribution of 
corresponding to p = ej being the jth basis vector of In general, we will be interested in 

the asymptotic distribution of a subset H C {l,...,p-|-iV} of the indices of 7 with cardinality 
h and shall show that asymptotically honest (uniformly valid) gaussian inference is possible in 
the presence of heteroskedasticity even for h ^ 00 and H simultaneously involving elements of 
a and 7 . 


3.2 Construction of 0 


As is clear from the discussion above we need a good choice for 0. In particular we shall show 
that 

©z 0 \ 

0 In ) 


0 = 


works well. Here Qz will be constructed using nodewise regressions as in Ivan de Geer et al. 
( 2 OI 4 I I and we show that this is possible even when the rows of Z are not independent and 


identically distributed. The construction of Qz parallels the one in van de Geer et al. ( 2014l l 


to a high extent but importantly for our context we do not need the rows of to be sparse 
for the nodewise regressions to work well. We will discuss the importance of this, once we have 
properly constructed Qz- First, define 


h = argmin | ^ 
5eRr-i 


\zj — Z-jS\\‘^ + 2A,iode||<J|h 


j = 1, -,P, 


(3.4) 


where Zj is the jth column of Z, Z-j is the NT x (p — I) submatrix of Z with Z’s jth column 
removed, and the (p — 1) x 1 vector <:f)j = '■ I = 1,... ,p,l / j}. Thus, 4>j is the Lasso 

estimator resulting from regressing zj on Z-j. Next, define 

^ 1 —01,2 • • • —01,p ^ 

- _ -02,1 1 • • • -02,p 

O — 

y —0p,i —0p,2 ■ ■ ■ 1 J 

and Tj = j^Wzj — Z_j0j|p + A„o(ie||0j||i as well as = diag(rf,...,Tp). Finally, we set 
Qz = T~^C. Let Cj denote the jth row of C and let Qz,j denote the jth row of Qz but both 
written as a p x 1 vectors. Then, Qz,j = any j = 1, ...,p, the KKT condition for a 

minimum in ()3.4p are 


]yj^Z_j(^Zj Z—j^j) + XnodeWj — 0 , 


(3.5) 
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where wj is the subdifferential of ||x||i evaluated at (pj. Using this, the definition of fj, and 
^'jWj = W^jW^ yields 


T? = - 1 

J NT 


Zj (Zj Z—j^j') + ^node 


‘"Jill 


NT 


i^j ^-3^3) ^r 


(3.6) 


Thus, by the definition of Szj’, and as is bounded away from zero (we shall later argue 
rigorously for this) 

= 1 . (3-7) 

Furthermore, the KKT conditions dSSD can also be written as 


-Z—jiyZj Z—j^j') ^node'^^ 


V3, 


NT ^ 

which implies -j^Z'_jZQz,j = ^nodeWjjfJ. Combining with (13.7p yields 


NT 


Z' ZQz,j ~ Cj 


< 


^node 


- f 2 ’ 

00 ' j 


(3.8) 


(3.9) 


which together with an oracle inequality for II 7 — 7||^ provides an upper bound on the jth entry 
of A in (13.31) . In other words, (13.91) will be used to show the required asymptotic negligibility 
of p'A in (13.3p by arguments made rigorous in the appendix. 


3.3 Asymptotic Properties of the Approximate Inverse 

In order to show that /9'0S'“^n'e is asymptotically gaussian one needs to understand the limiting 
behaviour of 0 constructed above. We show that 0 is close to 


0 = 


Qz 0 

0 Iat 




-1 


0 

0 Iat 


m an 


appropriate sense. To this end, note that by Yuan ( 201Cllf 
Qz,j,j = ^z,jj - '^zj-j^z!-j-j'^z-j,j and Qzj-j = -Qzpj'^zj-j^z!-j-j’ (3-10) 


where Qz,j,j is the jth diagonal entry of 0^, Qz,j,-j is the 1 x (p — 1) vector obtained by 
removing the jth entry of the jth row of Qz, ^z-j-j is the submatrix of ^z with the jth row 
and column removed, z,j-j is the jth row of z with its jth entry removed, z,-j,j is the jth 
column of ^z with its jth entry removed. Next, let Zi^tj be the jth element of zi^t and Zi^t-j 
be all elements except the jth. Define the (p — 1) x 1 vector 


N T 


Pj := argmm — 




i=l t=l 


such that 

NT / N T 

X] -j] I I X] X^ 

i=l t=l j \ i=l t=l 

Therefore, Qzj-j = —Qz,j,j4>'j showing that Qzj-j and </>'■ only differ by a multiplicative 
constant. In particular, jth row of Qz is exactly sparse if and only if cj)j is exactly sparse. More 



= ^z^-j,-j^z,-j,r (3.11) 
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generally, we shall exploit below that weak sparsity of one implies weak sparsity of the other. 
Furthermore, defining ■= we may write 

+ Cj,i,u for i = 1 , t = 1 ,..., T. 

where by the definition of 4 >j 

^ N T 

= 0- (3-12) 

i=l t=l 


Thus, in light of Theorem [H it is sensible that the Lasso estimator (j)j defined in (13.41) is close 
to the population regression coefficients (j)j (we shall make this more formal in Appendix A). 
Next, defining 


t/:=E 


1 

INT 


N T 

i=l t=l 




Qz, 




observe Qz,j,-j = Thus, we can write &z = T where T = diag(rf, ■.■,Tp) and C is 

defined similarly to C but with (pj replacing pj for j = 1 , Finally, let &z,j denote the jth 
row of Qz written as a column vector. In Lemma [T] below we will see that (pj and r? are close 
to pj and Tj, respectively such that Qz,j is close to Qz,j which is the desired control of Qz,j- 
Write p = {p'l, p' 2 )' with ||/ 9 || = 1, where pi € and p 2 G Hence define 

H = HiLi {H 2 + p) ■■= {j : pij / 0} U ({i : p 2 i ^ 0} + p) , 

with \Hi\ = = hi, \H 2 \ = h 2 ^N = ^-2 and \H\ = h = hi + /i 2 . In dynamic panel data 

models it may not be reasonable to assume that the rows of the inverse second moment matrix 
= Qz, i.e. Qz,j are sparse. Paralleling Section [22] we shall instead assume that the Qzj 
are weakly sparse and assume that 


k=i i^j 


(3.13) 


for some 0 <'d < 1 and Gj > 0. Dehne G := maxjg/^^ Gj. 

Assumption 4. 

(a) minevalp^z) is uniformly bounded away from zero and maxeval{^z) is uniformly bounded 
from above. 

(b) GX\-1 = 0{1). 

(c) There exist positive constants G and K such that o,re uniformly subgaussian; that is, 

T(|Cj,i,t| > e) < for every e > Q, i = I,... ,N, t = 1,... ,T and j = 1 ,... ,p. 

Assumption HKa) is standard and strengthens Assumption [ 2 | slightly. Recall that the pop¬ 
ulation matrix can can have full rank even when the empirical counterpart has rank 
zero - which it has when p -|- iV > NT. Note that Assumption HKa) implies that r? is uni¬ 
formly bounded away from zero as r? = 1/Qzjj > l/maxeval(0^) = mineva^T^). Similarly, 
Tj < maxeva^T^) implying that r? is bounded in ()3.13l) . Therefore, weak sparsity of 
translates into weak sparsity of th e rows of 0. Notice that we generalize the cross sectional 
results of Ivan de Geer et al.l (j2014l i by not imposing the inverse covariance (second moment 
matrix) of Zi^t to have sparse rows. When Zi^t is gaussian exact sparsity of is related to 
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the notion of conditional independence: the (j, A:)th entry of being zero is equivalent to 
being independent of Zi^t,k conditional on the remaining variables in Zi^f This is hard to 
justify in dynamic panel data models. First, it does not sound reasonable for Xj^i’s to be mostly 
conditionally independent given the lagged variables. Second, adjacent lagged variables 
and + 1 ) are not independent even after conditioning on all the other 

variables in Zi^t. In conclusion, it is important to relax the exact sparsity assumption on the 
rows of Qz in the context of dynamic panel data models. 


h implies in 


Part (b) restricts the rate of growth of G. As we shall choose Xnode - 

particular that G = O [{N/log^{p))^~). Part (c) imposes subgaussianity on the error terms 
from the nodewise regressions. 


Lemma 1. Let Assumptions [21 0 and hold. Define Xnode = \/l6M{logp)^/N for some 
M > 0. Then, for M sufficiently large, 


I -'‘2 2 I 

max r.- — r,- 

j&Hi ^ ^ 

1 

max —jr 

fj 

1 1 

jeH, rf rf 


max 

jeHi 




max||02 7 — 0z 7 
j&Hi" 


max 07 7 h 


Op ( 


Op(l) 


(iogpf 

N 


Op ( 


(logp)" 

N 


OpiG 


(logpfi 


N 


Op ( 


Op ( 


(logp)" 

N 

(iogpf 

N 


(3.14) 

(3.15) 

(3.16) 

(3.17) 

(3.18) 

(3.19) 


Lemma [U is used as a stepping stone towards the establishing asymptotically gaussian in¬ 
ference as provides the rate at which Qz approaches Qz uniformly over Hi. Note that for 
Hi = {l,...,p}, (13.171) provides an upper bound on the induced £oo-distance between Qz and 
Qz. However, we only need to control this distance for those indices corresponding to the 
parameters we seek the joint limiting distribution of. On the other hand, it should be stressed 
that the uniformity over Hi of the above results is crucial in establishing the limiting gaussian 
inference and providing a feasible estimator of the covariance matrix of the parameter estimates. 
In case one is only interested in one entry of 7 , Hi reduces to a singleton if this entry is in a. 
If this entry is in rj, Lemma [1] is actually superfluous as the lower right hand corners of 0 and 
0 are identical. 


3.4 The Asymptotic Distribution of 7 

In this section we formalise the discussion in Section [Q as Theorem |5J To this end, define 

( ^[Z’efiZ/{NT)] E [Zfifi D / (VnT)] ^ \ 

^ it* j g [D'ee'Z/ {VnT)] E [D'efiD/T] J Sg.jv J 


Sne = 
and note that 
Si, AT = E 


N N 




7=1 j = l 


Af NT 

7 = 1 7=1 t=l 
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where the second and third equality both follow from Assumption[TJ Likewise, = y IE [diEje'd'] 
T Eili TI=i IE \£ltdi,tdlt = diag(^ ELi T TI=i IE[eir,t]), where d' is the ith T x iV 


block of D, and di^t is a iV x 1 zero vector with the ith entry replaced by 1. In the same 


manner, 


^2,N = 


Ej=i IE [zi£i£id'i] 


Vnt 




dtZi,td'i,t 


In words, S 2 at is a 


p X N matrix with its ith column being Et=i Finally, motivated by the above, 

define the feasible sample counterpart of Sne as 


Sne = 


-‘1,N 


S'. 


2,N 



NT Z^i=l Z^t=l ^ 


yjvT 




i.t 


Vnt 


~2 j / J_ stT j If 

Z^i=l 2^t=l T 2-^i=l 2^t=l 


where ii^t •= yi,t ~ z’nCt — f]i. One could also consider constructing ei^t based on the desparsified 
estimates. However, this would require running the nodewise regressions for all variables and 
not only those pertaining to the coefficients in the hypothesis being tested resulting in a much 
more computationally demanding procedure. The following assumptions are needed to establish 
the validity of asymptotically gaussian inference of our procedure. 


Assumption 5. Let p := pV N V T and assume 


(a) 


(di Vh2l{/ll 7^0})2 g2 


{iospf 


1 - 1 ? 


N 


N 


(logp)" (log(Avr)) 3 l{L 2 / 0 } 

- = o(l), - - - =o(l)- 


(b) 


(c) 


hjG^ 


(logp)^ 


N 


-1? 


V 




(logp)' 


NT 


= 0 ( 1 ). 


{hi V / 12 ) 


G 


(logp)^ 

N 


-■d/2 


V (logp)^) l{hi 7 ^ 0} V l{/i 2 7 ^ 0} sf V E' 


^2 X/ p2 ( (log(pVW))^ 
T 


3 \ -I'- 


{log pY 


N 


= 0 ( 1 ). 


(d) mineval{Tu_e) is uniformly bounded away from zero and maxeval{T,i^N) is uniformly bounded 

from above. 

Assumption [5] is slightly stronger than what we actually need in order to prove Theorem [2] 
but it is less cluttered in terms of notation. Assumption [5] restricts the rate at which p, T, si, 
E, G, hi and /12 are allowed to increase as none of these are assumed to be bounded. First, note 
that p = L+px only enters through its logarithm. Thus, we can allow for very high-dimensional 
models. Furthermore, hi as well as /12 are allowed to increase with the sample size such that 
hypotheses of an increasing dimension involving a and p simultaneously can be tested. In the 
classical setting where one is only interested in testing hypotheses on a one has that /i 2 = 0 
such that Assumption [5] simplifies. The case of hypotheses only involving the fixed effects 1 / 
corresponds to /ii = 0 and again the assumptions simplify. We also note that Assumption [5] 
requires G and hi necessarily to be o{N^~), si necessarily to be o(A^/^), E necessarily to 
be o(A/A'(log(p V N)Y '^and /12 necessarily be o(r^/^). The restrictions on hi and / 12 , 
i.e. the number of common coefficients and fixed effects involved in the hypothesis (no fixed 
effects need to be zero), thus clearly encompass the classical setting where one tests only a fixed 
number of parameters {hi and /12 fixed). The assumptions of [5] are satisfied if, for example, 
p = N,T = !/ = '& = 0.5, Si = N^^^,E = and G = A^/'^. Thus, while we allow these 

quantities to diverge, the rate at which they do so must be under control. 
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Theorem 2. Let Assumptions [21 0 [3’ 0 satisfied. If, furthermore, {ei^t}f=i 

independent sequence for all i = I, then 

p'S (7 - 7) d 


iV(0,l), 


IS an 


(3.20) 




where p = {p'l, p' 2 )' is a {p + N) x 1 vector, with ||p|| = 1, pi G 


sup l/o'eSneO'p - p'0Sne©'yo| = Op{l). 

■yeT{si,iy,E) 

Finally, for every fixed set H C { 1 ,..., + p} with cardinality h, we have 

[Sh{ih - 1h)]' (© Sne ©')// [Sh{ih - in)] ^ xl- 


and p 2 € Moreover, 

(3.21) 


(3.22) 


Theorem [2] provides sufficient conditions under which our procedure allows for asymptot¬ 
ically gaussian inference. We stress again that hypotheses involving an increasing number of 
parameters can be tested and that the total number of parameters in the model may be much 
larger than the sample size. Furthermore, the error terms are allowed to be conditionally het- 
eroskedastic and we provide a consistent estimator of the asymptotic covariance matrix even 
for the case of hypotheses involving an increasing number of parameters. Indeed, this estimator 
converges uniformly over IF{si,v,E) even for high-dimensional covariance matrices which we 
use in Theorem [3] to establish the hone s ty (u niform validity) over this set of confidence inter¬ 
vals based on (I3.2U|) . Ivan de Geer et al.l (|2014l i have derived similar results in the setting of the 
homoskedastic linear cross sectional model for the case of inference on a low-dimensional param¬ 
eter. Thus, our results can be seen as an extension to dynamic panel data models. We stress 
again that we relax their assumption of the the inverse covariance matrix Qz being exactly 
sparse which is important in dynamic models like ours. Furthermore, relaxing the homoskedas- 
ticit y assumption is important as volatility is known to vary over time in dynamic models, see 

often depends on the state of the process, 
who consider inference in static panel data 


e.g. Englel ( 1982lb and the conditional vo l atilit 


Theorem [2] is also related to iBelloni et al.l (j2ni^ 
models for a low-dimensional parameter of interest. 

The classical setup where one is only interested in inference on a corresponds to p 2 = 
0 such that y/NTp[ {a — a) is asymptotically gaussian with variance equal to the limit of 
p'^QzFii^nQ'zPi (assumed to exist for illustration). If, furthermore, is homoskedastic with 
variance and independent of Zi^ for all i = and t = 1,...,T, it follows from the 

definition of that this variance equals the limit of a^p'iQzPi = Pi- The leading 

special case where one is interested in testing a hypothesis on the j’th entry of a corresponds 
to Pi = Cj. Similar reasoning shows that in the case where one is testing hypotheses involving 
fixed effects only, corresponding to pi = 0 , one has that P2VT {fj — p) is asymptotically gaussian 
with variance cr^. This simple form of the variance follows from the asymptotic independence of 
the components of fj. Note that the different rates of convergence for a and fj are in accordance 
with Theorem [1] 

(|3.22l) is a straightforward consequence of (|3.2UI) and reveals that classical inference can 
be carried out in the usual manner. Thus, asymptotically valid x^-inference can be performed in 
order to test a hypothesis on h parameters simultaneously. Wald tests of general restrictions of 
the type Hq : g{'j) = Q (where g : —)• is differentiable in an open neighborhood around 

7 a nd has der i vative matrix of rank h) can now also be constructed in the usual manner, see 
e.g. DavidsonI (2000) Chapter 12, even when p + N > NT which has hitherto been impossible. 

Finally, the independence assumption on across t is needed only if one tests hypotheses 
involving {pi}^i (/12 0). Weaker assumptions on the error terms, such as strong mixing, are 

possible at the expense of more involved expressions but will not be pursued here. 
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4 Honest Confidence Intervals 


In this section we show that the confidence bands based on (j3.2nD are honest (uniformly valid) 
and contract at the optimal rate. The precise result is contained in the following theorem. 

Theorem 3. Let AssumptionsUl\^\^ and\^ he satisfied. Then, for all p G with ||/9|| = 1, 

p's (7 - 7 ) 


sup sup 

tm 'r£Tisi,iy,E) 


<t\- m 


= o(l)> 


(4.1) 


/p'GTneO'p / 

where 4>(-) is the CDF of the standard normal distribution. Furthermore, define (Tqj := 


[QzT,i^NQz]jj and := C[T, 2 ,,N]ii for j = l,...,p and i = l,...,N, respectively. Then, 


lim inf inf T I a,- G 

N^oo 'y£T{s-i,u,E) ' 


lim inf inf IP I ry, G 

N^oo 'y£T{si,iy,E) 


Oij — ^ 1 - 5/2 


a, 




y/NT 


dij + Zi_5/2 


cr, 


OL,3 


Vi - Zi-Sl2^,Vi + 


y/NTi 

> 1 - 5 , 


>1-5, 


(4.2) 

(4.3) 


for j = l,...,p and i = 1, ...,N, respectively, where Z\_^i 2 is the 1 — 5/2 percentile of the standard 
normal distribution. Finally, letting diam{\a,h]) = b — a be the length (which coincides with the 
Lebesgue measure of [a, b]) of an interval [a, b] in the real line, we have 


sup 

diam 

'y£E{si,u,E) 


sup 

diam 

'yeE{si,u,E) 



— ^1-5/2 


< 7 , 




^/NT 


, dij + Zi_si2 


< 7 , 


a,3 


y/NTi 


Vi - Zl-S/2^,V^ + ^ 1 - 5/2 


= 0 . 


= Op 

1 

Vt 


y/NT^ ’ 


(4.4) 

(4.5) 


for j = 1, ...,p and i = 1,..., N, respectively. 


()4.ip reveals that the convergence to the normal distribution in Theorem [2] is uniform over 
J-{si,e,E). Since the desparsi fied Lasso is not a sparse estimator this uniform convergence 
does not contradict the work of lLeeb and Potscherl (|2005l b Next, (14.21) is a direct consequence 
of (|4.1I) and reveals that the desparsified Lasso produces confidence bands which are honest 
(uniform) over iF{si,v, E). Honest confidence bands are important in practical applications 
of dynamic panel data models as they guarantee the existence of an Nq, not depending on 
7 G F{si,i',E), such that [dj — Zi_^i 2 -^^=,oij + Z\_^i 2 covers aj with probability not 

much smaller than 1 — 5. Here the important point is that one and the same Vq guarantees 
this coverage, irrespective of the true value of 7 G J-{^s\,v,E\ On the other hand, pointwise 
consistent confidence bands only guarantee that 


CK 7 0( 7 

aj G did — a.- + 


>1-5, 


implying that the value of N needed in order to guarantee a coverage of close to 1 — 5 may 
depend on the unknown true parameter. Thus, for some parameter values one may have to 
sample more data points to achieve the desired coverage than for others which is unfortunate 
as one does not know for which parameters this is the case. An honest confidence set Sjsi for 
aj can of course trivially be obtained by setting Sjq = M. However, this is clearly not very 
informative and therefore (I13D is reassuring as it guarantees that the length of the honest 
confidence interval contracts at the optimal rate. In particular, the confidence bands are uni¬ 
formly narrow over ^"( 51 , v, E) in the sense that for any e > 0 there exists an M > 0 such that 


diam 


^ 1 - 5/2 + ^ 1 - 5/2 


M 


— y/NT 


for all 7 G F{si,i',E) with probability at 
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least 1 — e. Therefore, our confidence bands are not only honest, they are also very informative 
as they contract as fast as possible and this contraction is uniform over iF{si,i',E). Since the 
desparsihed Lasso i s not a sparse es timator, this fast contraction does not contradict inequality 
6 in Theorem 2 of Potscher ( 20091 1 who shows that honest confidence bands based on sparse 
estimators must be large. 

Similarly to the conhdence bands pertaining to a, the ones for the fixed effects are also 
honest and contract at the optimal rate. Note that this rate is again slower than the one for a. 
It is also worth remarking that the above inference results are valid without any sort of lower 
bound on the non-zero coefficients as inference is not conducted after model selection. 


5 Monte Carlo 


In this section we investigate the finite sample properties of our estimator by means of simu¬ 
lations. All calculations are carried out in R usin g the glmnet pa, ckage and and \node are 
chosen via BIC by the formula given in (9.4.9) in Davidson ( 2001)1 1 . Cross validation was also 
considered, but this did not alter the results while being considerably slower. We leave it for 
future work to establish theoretical performance guarantees on these procedures in the setting 
of high-dimensional dynamic panel data models. 

The data generating process is (ttH) and in all experiments ( 01 , 02 ) 03 ) 04 ) = (0.9,0, 0,—0.3) 
such that the roots of the corresponding lag polynomial he outside the unit disk. In practice, 
one might not know the true lag length and usually specihes a reasonably large number of lags 
(to test downwards). To reflect this in our simulations, we always included 5 lags but also 
experimented with more than 5 lags. The results were not sensitive to this. 

For each i = 1, ...,iV, the Xi^t are generated according to the autoregressive structure 


where the edistur,i,t are Px x ^ random disturbance vectors independent across i and t. ax is an 
autoregressive scalar which controls the temporal dependence of Xi^- For simplicity, we restrict 
Ux to be the same across i. When = 0, we have temporal independence across t for Xi^- 
Since Assumption [T] does not restrict any temporal dependence of Xi^, we set Ox = 0.5. Our 
simulation results are reasonably robust to the choice of ax- The covariance matrix of edistur,i,t 
is chosen to have a Toeplitz structure with the {i,j)th entry equal to with p = 0.75. 

We also experimented with other choices of p which did not change the results dramatically. 
Furthermore, we also tried to let the covariance matrix of edistur,i,t be block-diagonal. Again, 
this did not alter our results. 

We allow the fixed effect pi to depend on the initial observation of xp 

Vi = x'i^ibij/V'^ogpx i = l,...,N, 

where br^ is apxXl vector whose entries are drawn from standard normal and normalized to have 
unit £i-norm. Note that \pi\ < ||xjq/\/Iog^||oo|imii = \\xip/\/\ogpx\\oo- If Xip is multivariate 
normal, then ||xiq||oo = Op{y/\ogpf). In this sense, pi is bounded. However, p is not necessarily 
weakly sparse and thus we also investigate how robust our results are to violations of this 
assumption. Of course our estimator performed much better in the truly weakly sparse setting 
than the setting we present here (results available upon request). 

As our theory allows for heteroskedasticity, we also investigate the effect of this. To be 
precise, we consider error terms of the form Ei^ = {xi^p/\/2 + hxXid^^^) where Uip is inde¬ 
pendent of yid-i, ...,yip-L and Xip, ...,Xip. bx is chosen such that the unconditional variance of 
Eid is the same as the one of Ui^ which in turns equals the one from the homoskedastic case. A 
simple calculation reveals that bx = [—V^p + \/2p^ + 2 — da^) /2. Note that Ei^ constructed 
this way satisfies Assumption [TJ The reason we ensure that the unconditional variance is the 
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same as in the homoskedastic case is that we do not want any findings in the heteroskedastic 
case to be driven by a plain change in the unconditional variance. 

Our estimator is compared to the least squares oracle which only includes variables with 
non-zero coefficients on top of those variables we wish to test hypothesis about. Thus, it 
is an oracle which knows the relevant control variables. When sample size allows it, that 
is when p + N < NT, we also implement naive least squares including all variables. This 
estimator is numerically equivalent to t he often used within est imator. Finally, we implemented 
the desparsified conservative Lasso of Caner and Kock ( 20141 1. However, this only improved 
the results slightly and so we do not report these results here. The number of Monte Carlo 
replications is 1,000 for all setups and we consider the performance of our estimator along the 
following dimensions: 


1. Estimation error: We compute the root mean square errors (RMSE) of all procedures 
averaged over the Monte Carlo replications. 

2 . Coverage rate: We calculate the coverage rate of a gaussian confidence interval constructed 
as in Theorem [3l This is done for three coefficients of regressors in Xi^f 

3. Length of confidence interval: We calculate the length of the three confidence intervals 
considered in point 2 above. 

4. Size: We evaluate the size of the y^-test in Theorem [2] for a hypothesis involving the same 
three parameters we construct confidence intervals for in point 2 above. 

5. Power: We evaluate the power of the y^-test in point 4 above. 


All tests are carried out at the 5% level of significance and all confidence intervals have a nominal 
coverage of 95%. Eurthermore, as our results regarding estimation error are for the plain Lasso, 
the root mean square errors are reported for this instead of the desparsified Lasso. As our 
models are dynamic, we allow for a burn-in period of 1,000 observations when generating the 
data. 

The following experiments were carried out 

• Experiment 1: (moderate-dimensional setting): = 20 and T = 10. /? is 100 x 1 with five 

equidistant non-zero entries equaling one. Thus, p = 105 and si = 7. In total, 7 = (a', rj')' 
is 125 X 1. The disturbances of Xi^t, edistur,i,t, are gaussian and Ei^t are standard gaussian. 
We test the true hypothesis 


Ho : (77,727,747) = ( 0 , 0 , 0 ) 

by the x| test described in Theorem [2] in order to gauge the size of the test. The power 
is investigated by the hypothesis 

Ho '■ ( 77 , 727 , 747 ) = (0.4,0,0). 

The following variations of this setting are considered 

(a) The baseline case described so far. 

(b) Same as (a) but with heteroskedastic errors. 

(c) Same as (b) but edistur,i,t and Si^t are t-distributed with 3 degrees of freedom. In this 
case, even pi may not be Op{l). 
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• Experiment 2: (high-dimensional setting). = 20 and T = 10. /3 is 400 x 1 with five 
equidistant non-zero entries equaling one. Thus, p = 405 and si = 7. In total, 7 = (a', 7 ')' 
is 425 X 1. The disturbances of Xi^t, ^distur,i,t-, are gaussian and Si^t are standard gaussian. 
We test the true hypothesis 


Hq ■ (77,787,716?) = (0,0,0) 

by the xl test described in Theorem [2] in order to gauge the size of the test. The power 
is investigated by the hypothesis 

Ho : (77,787,716?) = (0.4,0,0). 

The following variations of this setting are considered 

(a) The baseline case described so far. 

(b) Same as (a) but with heteroskedastic errors. 

(c) Same as (b) but edistur,i,t and are t-distributed with 3 degrees of freedom. In this 
case, even pi may not be Op(l). 

• Experiment 3: (increase T): As Experiment 2 but with T = 40. 

• Experiment 4: (increase N): As Experiment 2 but with N = 40. 

• Experiment 5: (high-dimensional setting 2). N = 20 and T = 40. (3 is 1005 x 1 with 

15 equidistant non-zero entries equaling one. Thus, p = 1010 and si = 17. In total, 

7 = {a',p'y is 1030 X 1. The disturbances of Xi^, edistur,i,tj are gaussian and are 
standard gaussian. We test the true hypothesis 

Ho ■ ( 77 , 774 , 7141 ) = (0,0,0) 

by the xi test described in Theorem [2] in order to gauge the size of the test. The power 
is investigated by the hypothesis 

Ho ■■ ( 77 , 774 , 7141 ) = (0.4,0,0). 

The following variations of this setting are considered 

(a) The baseline case described so far. 

(b) Same as (a) but with heteroskedastic errors. 

(c) Same as (b) but edistur,i,t and are t-distributed with 3 degrees of freedom. In this 
case, even pi may not be Op{l). 

Table [T] contains the results of experiment 1. Setting 1(a) reveals that the RMSE of the 
Lasso are lower than those for least squares including all variables but higher than those of least 
squares only including the relevant variables. This is the case for a as well as the fixed effects. 
Next, it is very encouraging that the coverage probabilities for the desparsified Lasso are close 
to the ones based on the oracle. The length of the confidence intervals are also comparable for 
those two procedures while the ones based on the within estimator are considerably wider while 
still having a lower coverage. The oracle and the desparsfied Lasso both produce tests which 
are a bit oversized but they are still much better than the within estimator. The same is true 
when it comes to power. 

Experiment 1(b) adds heteroskedasticity to the error terms and none of the procedures is 
affected by this. 

In Panel 1(c) the random variable have heavy tails. Overall, and as expected, all procedures 
suffer from this. However, it is worth mentioning that the coverage rate of the confidence 
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RMSE Coverage Length 




a 

V 

77 

727 

747 

77 

727 

747 

Size 

Power 


LS 

23.744 

16.382 

0.762 

0.786 

0.766 

0.552 

0.549 

0.553 

0.412 

0.741 

1 (a) 

DL 

3.061 

8.528 

0.892 

0.918 

0.884 

0.395 

0.396 

0.396 

0.150 

0.852 


Ora 

0.523 

7.226 

0.933 

0.939 

0.919 

0.402 

0.403 

0.404 

0.074 

0.907 


LS 

23.796 

16.444 

0.732 

0.760 

0.747 

0.548 

0.543 

0.547 

0.453 

0.747 

1 (b) 

DL 

3.011 

8.298 

0.920 

0.914 

0.903 

0.408 

0.389 

0.391 

0.135 

0.846 


Ora 

0.524 

7.079 

0.920 

0.931 

0.937 

0.401 

0.393 

0.398 

0.092 

0.904 


LS 

46.140 

51.979 

0.753 

0.772 

0.743 

0.910 

0.883 

0.889 

0.432 

0.605 

1 (c) 

DL 

4.747 

23.939 

0.912 

0.901 

0.883 

0.619 

0.544 

0.567 

0.159 

0.632 


Ora 

1.200 

23.596 

0.907 

0.937 

0.913 

0.662 

0.617 

0.617 

0.091 

0.651 


Table 1: Experiment 1 . LS, DL and Ora: least squares including all variables, desparsified Lasso and 
least squares oracle. RMSE: root mean square error. Coverage: the coverage rate of the asymptotic 
95 % confidence intervals. Length: the average length of the asymptotic 95 % confidence intervals. Size: 
size of the correct hypothesis Hq : (77,727,747) = ( 0 , 0 , 0 ). Power: the probability to reject the false 
Ho ■ (77,727,747) = ( 0 . 4 , 0 , 0 ). 


intervals does not decrease. Instead, the length of these intervals increases to reflect the larger 
uncertainty. The size of the significance test is not affected either while the power suffers. 

Next, we turn to experiment 2(a) which is high-dimensional. The results can be found in 
Table[21 As expected, the estimation error is higher for the Lasso than for the oracle. However, it 
is encouraging that the confidence intervals produced by the desparsihed Lasso have a coverage 
which is as almost as good as the one for the the oracle. In fact, the coverage rate is close to 
identical to the one in the above moderate-dimensional simulation. The length of the confidence 
bands based on the desparsifed Lasso is actually slightly lower than the ones based on the oracle 
which explains their slightly worse performance. The siginificance test is again a bit oversized 
for the oracle as well as the desparsihed Lasso but the size is not far from the one in Table [TJ 
Power is also virtually unaffected by the increase in dimension. 

Experiment 2(b) adds heteroskedasticity and the results are not affected by this. Finally, 
the addition of heavy tails in Experiment 2(c) makes the estimators less precise. However, and 
as in the moderate-dimensional setting above, the coverage remains high since the conhdence 
bands get wider. The size of the signihcance test is unaffected while the power goes down for 
the oracle as well as the depsarsihed Lasso. 

In Table [3l T has been increased to 40 compared to Table [2j This results in lower estimation 
errors for the Lasso as well as oracle assisted least squares. The coverage rates of the conhdence 
bands also improve and get closer to the nominal rate. At the same time, the bands also get 
narrower. The size of the signihcance test also improves and the power of the oracle and the 
desparsifed Lasso is 1. As above, adding heteroskedasticity does not alter the results. The 
consequence of heavy tails are also the same: higher estimation error, no change in coverage of 
conhdence bands, wider bands, unchanged size, but lower power. 

Table U] increases N to 40 compared to Table [2j This results in more hxed effects to be 
estimated. Thus, it is not surprising that the estimation error for a goes down while the one for 
rj increases. The coverage rates of the oracle as well as the desparsihed Lasso improve compared 
to Table [2J However, the length of the conhdence bands does not decrease as much as when 
T was increased in the previous experiment. The size of the signihcance test decreases when 
increasing N while power is close to 1. Adding heteroskedasticity has no consequences while 
the presence of heavy tails has the usual effect. 

Table [5] contains a setting with more than 1000 variables. The main message of the previous 
tables prevails even in this setting: the coverage of the Lasso-based conhdence intervals is 
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RMSE 


Coverage 



Length 






a 

V 

77 

787 

7167 

77 

787 

7167 

Size 

Power 

2(a) 

LS 

DL 

4.209 

8.333 

0.875 

0.893 

0.881 

0.386 

0.385 

0.386 

0.189 

0.841 


Ora 

0.513 

7.103 

0.919 

0.918 

0.924 

0.402 

0.403 

0.403 

0.110 

0.922 

2(b) 

LS 

DL 

4.165 

8.322 

0.896 

0.872 

0.861 

0.407 

0.379 

0.381 

0.189 

0.825 


Ora 

0.535 

7.074 

0.906 

0.913 

0.929 

0.401 

0.396 

0.397 

0.101 

0.899 

2(c) 

LS 

DL 

7.602 

22.895 

0.916 

0.868 

0.882 

0.622 

0.543 

0.551 

0.193 

0.602 


Ora 

1.074 

21.724 

0.922 

0.944 

0.944 

0.657 

0.619 

0.632 

0.076 

0.674 


Table 2: Experiment 2. LS, DL and Ora: least squares including all variables, desparsified Lasso and 
least squares oracle. RMSE: root mean square error. Coverage: the coverage rate of the asymptotic 
95% confidence intervals. Length: the average length of the asymptotic 95% confidence intervals. Size: 
size of the correct hypothesis : ( 77 , 787 , 7167 ) = (0,0,0). Power: the probability to reject the false 
Hq : (77,787,7167) = (0.4,0, 0). 


almost as good as the ones based on the oracle. On the other hand, the bands of the former are 
now slightly wider than the ones of the latter. Both procedures have power close to one while 
the Lasso-based test is a bit oversized compared to the oracle based test. Heteroskedasticity 
does not affect the results. The consequences of heavy tails are the same as in the previous 
experiments. 
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RMSE Coverage Length 




a 

r] 

77 

787 

7167 

77 

787 

7167 

Size 

Power 


LS 

40.341 

7.367 

0.815 

0.827 

0.796 

0.266 

0.266 

0.268 

0.325 

0.993 

3(a) 

DL 

1.208 

2.121 

0.931 

0.918 

0.925 

0.190 

0.190 

0.190 

0.098 

1.000 


Ora 

0.223 

3.208 

0.943 

0.945 

0.933 

0.186 

0.187 

0.187 

0.052 

1.000 


LS 

40.351 

7.376 

0.800 

0.823 

0.813 

0.267 

0.265 

0.267 

0.318 

0.993 

3(b) 

DL 

1.169 

2.139 

0.923 

0.915 

0.924 

0.200 

0.189 

0.189 

0.104 

1.000 


Ora 

0.233 

3.235 

0.955 

0.940 

0.953 

0.189 

0.185 

0.186 

0.060 

1.000 


LS 

88.649 

29.738 

0.766 

0.813 

0.837 

0.460 

0.452 

0.448 

0.315 

0.821 

3(c) 

DL 

2.630 

7.689 

0.931 

0.922 

0.913 

0.359 

0.309 

0.319 

0.090 

0.926 


Ora 

0.610 

10.759 

0.930 

0.963 

0.951 

0.329 

0.299 

0.302 

0.056 

0.962 


Table 3: Experiment 3. LS, DL and Ora: least squares including all variables, desparsified Lasso and 
least squares oracle. RMSE: root mean square error. Coverage: the coverage rate of the asymptotic 
95% confidence intervals. Length: the average length of the asymptotic 95% confidence intervals. Size: 
size of the correct hypothesis : ( 77 , 787 , 7167 ) = (0,0,0). Power: the probability to reject the false 
Ro : (77,787,7167) = (0.4,0, 0). 




RMSE 


Coverage 



Length 






a 

V 

77 

787 

7167 

77 

787 

7167 

Size 

Power 

4(a) 

LS 

DL 

2.145 

14.002 

0.924 

0.891 

0.920 

0.261 

0.262 

0.263 

0.123 

0.988 


Ora 

0.351 

13.570 

0.936 

0.934 

0.930 

0.282 

0.283 

0.285 

0.065 

0.999 

4(b) 

LS 

DL 

2.145 

14.073 

0.933 

0.904 

0.911 

0.275 

0.260 

0.263 

0.114 

0.979 


Ora 

0.368 

13.469 

0.926 

0.931 

0.940 

0.283 

0.281 

0.282 

0.076 

1.000 

4(c) 

LS 

DL 

4.305 

42.303 

0.917 

0.899 

0.903 

0.456 

0.386 

0.390 

0.139 

0.825 


Ora 

0.782 

41.269 

0.926 

0.934 

0.926 

0.465 

0.428 

0.435 

0.087 

0.870 


Table 4: Experiment 4. LS, DL and Ora: least squares including all variables, desparsified Lasso and 
least squares oracle. RMSE: root mean square error. Coverage: the coverage rate of the asymptotic 
95% confidence intervals. Length: the average length of the asymptotic 95% confidence intervals. Size: 
size of the correct hypothesis : ( 77 , 787 , 7167 ) = (0,0,0). Power: the probability to reject the false 
Ro : (77,787,7167) = (0.4,0, 0). 
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RMSE 

Coverage 


Length 


Size 

Power 

a 

7] 

77 

774 

7141 

77 

774 

7141 


LS 











5(a) 

DL 

3.342 

2.463 

0.922 

0.903 

0.912 

0.209 

0.209 

0.208 

0.124 

0.997 


Ora 

0.546 

3.305 

0.956 

0.942 

0.949 

0.187 

0.187 

0.187 

0.062 

1.000 


LS 











5(b) 

DL 

3.327 

2.432 

0.913 

0.916 

0.896 

0.218 

0.207 

0.208 

0.127 

0.998 


Ora 

0.556 

3.261 

0.942 

0.951 

0.921 

0.190 

0.186 

0.186 

0.065 

1.000 


LS 











5(c) 

DL 

7.294 

7.273 

0.936 

0.910 

0.916 

0.363 

0.307 

0.304 

0.101 

0.920 


Ora 

1.072 

9.693 

0.952 

0.936 

0.951 

0.326 

0.297 

0.293 

0.061 

0.955 


Table 5: Experiment 5. LS, DL and Ora: least squares including all variables, desparsified Lasso and 
least squares oracle. RMSE: root mean square error. Coverage: the coverage rate of the asymptotic 
95% confidence intervals. Length: the average length of the asymptotic 95% confidence intervals. Size: 
size of the correct hypothesis : ( 77 , 774 , 7141 ) = (0,0,0). Power: the probability to reject the false 
Hq : ( 77 , 774 , 7141 ) = (0.4,0, 0). 
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6 Conclusion 


This paper has considered inference in high-dimensional dynamic panel data models with fixed 
effects. In particular we have shown how hypotheses involving an increasing number of variables 
can be tested. These hypotheses can involve parameters from all groups of variables in the 
model simultaneously. As a stepping stone towards this inference we constructed a uniformly 
valid estimator of the covariance matrix of the parameter estimates which is robust towards 
conditional heteroskedasticity. We also stress that our theory does not require the inverse 
covariance matrix of the covariates to be exactly sparse. 

Next, we showed that confidence bands based on our procedure are asymptotically honest 
and contract at the optimal rate. This rate of contraction depends on which type of parameter 
is under consideration. Simulations revealed that our procedure works well in finite samples. 
Future work may include relaxing the sparsity assumption on the inverse covariance matrix 0z 
as well as extending our results to non-linear panel data models. 


7 Appendix A 


7.1 Sufficient Conditions for yi^t to be Subgaussian 


The following Lemma provides sufficient conditions for yi^t to inherit the subgaussianity from 
the covariates and the error terms. It allows for a wide range of models but rules out dynamic 
panel data models which are explosive or contain unit roots. 


Lemma 2. Let Xi^t,k and £i^t be uniformly subgaussian for i = t = and k = 

1, ...,Px and assume that ||/3||]^ < C for some C > 0 for all N and T. Furthermore, maxi<i< 7 v |r?i| 
is bounded uniformly in N and T. Then, if all roots of 1 — foi ,... ,aL fixed) are 

outside the unit disc, yt^t is uniformly subgaussian for i = 1, ...,N and t = 1, 


Proof of Lemma\^ Let yt = Yl^=i^jyt-j P'at be an AR(L) process with roots outside the 


unit disc. Write the compan ion form as ft = Fft-^ + v-, 
theorem for Orlicz norms, see Ivan der Vaart and Wellne: 


The n, by the monotone convergence 


'Jt- rne i 

^ (Il996l l 


exercise 


6 , page 105,||||Ci 


IV’2 


< 


IIE^I ll^^'ll ^2 ll^t-ill ||^2 = WF'Hlh where || • is the 

£2 induced norm, and the last equality used that vt i s L x 1 with only one non-zero entry equaling 
ut- By Corollary 5.6.14 in Horn and Johnson ( 199Cll l there exists a 1 > (5 > 0 such that ^ 

(1 — Sy for j sufficiently large. Thus, if is uniformly bounded we conclude ||?/i ||^2 — 

line ' 


111 11.02 — ^ some iL > 0. Thus, in our context it suffices to show that ||x^ + rii + Sifi 


is uniformly bounded as yi^t = Ej=i + r/i + £i,t = Ylj=i + 'ai,t with 

'“*4 = + £i,t- But \\x[^^|3 + yi + < Ei=i l/3jl + 


b2 


is bounded by the assumptions made. 


'£’2 


+ 


which 

□ 


7.2 Proof of Theorem [T] 

Before we proceed to the proof of Theorem [U we introduce y* = r]il{\r]i\ > H} for some H > 0. 
We shall choose E = Next, J 2 = {i ■ y* 7 ^ 0, i = 1,..., A"} and S 2 = |T 2 |- Introduce the 

events 

An= l^\\Z'e\\oo<^, ||A'£||oo < ^|, ^iV= |Ac2('kiV,Sl,S2) > ^|. 

Lemma 3. On the event An, the following inequalities are valid 

l|n(7-7)ll^ + Aiv||d-a||i-F-^||7)-7y||i < 4XN\\aj^-ajy\i + A^\\yj^-yjy\i+A^EE^-’'; 

(7.1) 
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||ajf -ajflli ^ 3||aji -ajJli +3-^||57 j 2 -r/j2||i + 4-^EH^ (7.2) 

Proof. By the minimizing property of the Lasso, 

lly-nyf+ 2AAr||a||i + 2^||r)||i < ||y - Byf + 2AAr||a||i +2-^||r/||i 


such that inserting y = By + e yields 

l|n (7 - 7)11^ < 2e'n(7 - 7 ) + 2 AAr(||a||i - ||d||i) + 2 


Note that on AIat 


Aat 

^/N 


(7.3) 


Xn 


2 e'n (7 - 7 ) < 2||e'Z||oo||d - a||i + 2||e'D||oo||?? - 7||i < AatHq; - q:||i + ■^^11’? “ ^Ill- 


Using this and adding AatHq; — a||i + -^1 

Xn II.. 


to both sides of (|7.3I1 gives 


l|n(7 - 7)11 + ^ivll« - “111 + 


Vn 


^ 2A7V'(||ii||r — ||ci||i “t" ||ci — “ 111 ) X~ 2 


Xn 

Viv 


1 -II 7 II 1 + II 7 - 7 IIU 

Xn 


< 2 AAr(||aji 111 - ||dji||i + ||dji - aji||i) + 2 -^{\\r]j^\\i - H^jaHi + ||f/j 2 — 7J2II1 + ‘ 2 EE^ ^) 

< 4AAr||dj^ - ajJli + 4-^||57 j 2 - r/j2||i + 4.-^EE^~’', 
where the second inequality is due to 

N N 

WvJ^Wi - WvJ^Wi + WvJi -7J|lli < ‘^Wv^Wi = 2^|7j|l{|7*l < H} < 2“^"''^ |77i|'"l{|7i| < -} < 2L; 


2=1 


2=1 


We proved (17.ip . (17.2p follows trivially from this. 


□ 


Lemma 4 (Deterministic oracle inequalities). Let Assumption\^hold. On the event An O 
Bn one has for any positive constant Xn, 


/A \ 

e' 


1 -u 


120 AArSi /120 


4nt + “ 


y/N 


\VntJ 


120AArsi /120 \ TJ, f Xn \ 

- - 47W¥ ^(- 4 ^ 7 ^ [7W¥) 


1-u 


Moreover, the above bounds are valid uniformly over Fis\,v,E) := |a G 
{7eK^:Ef=il7.r<i5;}. 

Proof. By (17.ID of Lemma El which is valid on AIat, 


a 


< -si} 


||n (7 -7)||2 < 4A„||dj. - +4^E=l-'. (7.4) 
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Consider the auxiliary event 




• l-u 


1 ,_ ,, 1 _ ,, r 

< -\\aj^ - ajJli + - rjj^Wij . 


On the event An from (j7.2l) of Lemma [3l we have 


|djf - ajflli + -'nj^h < 4||dji - ajJli +4 -^||i7j2 -VJ^h- (7-5) 


In order to apply the compatibility condition, re-parametrise the vector 6 in the dehnition of 
the compatibility condition as follows. Let and 6^ be p x 1 and iV x 1 vectors, respectively, 
with b = (6^',6^')' defined as 


b^ 

b^ 


Ip 0 \ / <51 \ 

0 VNIn J )' 


Hence, that pi, r 2 ) is bounded away from zero for integers ri € p} and r 2 G 

{1,..., A^} is equivalent to 


K^{^N,n,r 2 ) 


min 

iJ2C{l,...,Af},|iJ2|<r2 

R:=RiUR 2 


min 

bemp+^\{o}, 


NT 

ri+r 2 


I|n6f 



(7.6) 


By (17.511 . our estimator satisfies the constraint of the just introduced version of the compatibility 
condition and so 




> 


> 


Si + S2 

K^{^N,Si,S2 )NT 
Si + S2 

kINT 

2(si + S 2 ) 


dji - aji \ 

(f/J2-^J2)/v^ ) 


dji - ajilli + j^\\flJ2 - ^J2\\t) 


djl - ajlIll + jyll^ - ^J2lll) > 


where the last inequality is valid on Bn- Hence, on An H Bn HCw upon combining with (17.411 
one has. 


rcoIVlL /ii /. ii9 f II.. 112^ ^ ^ \ - II 4: A TV m .. m 

2 (^ 77 ^^ ) < 4ATv||aji - ajilli + -J^\\'nJ2 “ ^J 2 lli + 


Vn 

< 5ATv||dji - ajilli + “ ^J2lli) 


y/N 


which, since > 0 by Assumption [21 is equivalent to 


||dji - ajilli - 


10Atv(si + S 2 ) 

Tlivr 


||«Ji - ajil 


1 + - VJ 2 


10 Atv(si + S 2 ) 
k^N^AT 


\\VJ 2 


1 ^ 


< 0 . 


Let X = ||dji - ajilli, y = Wfjj^ - Vj 2 \\i, a = 
has 


10A^(si+S2) 

k^NT ’ 


— ax + by^ — cy < 0 . 


N 


and 


10Ajv(si+S2) 

k^N^/^T 


. Thus one 


(7.7) 


First bound x = ||q;j^ — ajJli. For every y the values of x that satisfy the above quadratic 
inequality form an interval in R+. The right end point of this interval is the desired upper 
bound on x. Clearly, by the solution formula for the roots of a second degree polynomial, this 
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right end point is a decreasing function in — cy. Hence, we first minimize the polynomial 
by"^ — cy to find the largest possible value of x which satisfies (IZ2D. This yields y = c/26 and the 
corresponding value of 6y^ — cy is —c^/(46). Hence, our desired upper bound on x is the largest 
solution of — ax — <0. By the standard solution formula for the roots of a quadratic 

polynomial this yields 


II. II ^ a + + c^jh c 

||aji - ajilli = x< -^-< a + 


2 ' 2-sfh' 

Switching the roles of x and ?/, one gets a similar bound on y = \\fjj^ — yjjHi, namely 

ii. II c + Vc^ + ba? c a 

\\vj2 -vj2\\i = y< --< T + 


26 6 2Vb’ 

Inserting the definitions of a, 6 and c into (|7.8p and ()7.9p . we get 

15A7v(si + S 2) 


||aji - ajilli < 


k^NT 


(7.8) 

(7.9) 

(7.10) 

(7.11) 


15A7v(si + S 2) 

\\VJ 2 -VJ 2 \\l - ^2jYl/2r 

Therefore, on An H Bn H Ctv, it follows from (|7.1D that 

w-nr-' 11 - II , ^AaTii^ ^ II , 4AAr ^ 120A^(si + S 2 ) , 4AAr 

||n(7 7)11 _iXN\\aj, ajJ|i+ ^ 

II - II ^^11-' II I II ~ II I ^ 120AAr(si + S 2 ) 4 

11“ - - «aIIi + - VJ,\h + 

Ih)- ^?lll < 4\/iV||dji - ajilli +4||7 )j 3 - Tjjalli 


y/NT 


On An OC^ one has trivial oracle inequalities via (17.111 of Lemma [3l To be precise, 

7-r7|h <2QEr}-\ 


||n (7 - 7 )f < 20A7 v^^^, ||d - a||i < 20 -^^' " 


yfN ’ 


^ ’ 


These inequalities are valid on event An O Bn O too. Synchronising constants, using that 
{An n Bn n C^) U {An n Bn n Ctv) = An n Bn, and recognising that 


N 


N 


S2 ■ = 


:= l{|,.|>=) = ^l{|.,,r >=-)<£: 


2 = 1 


2 = 1 


We arrive at 


_ )||2 < 120A^(si+E£-) 

' - kINT y/N 

120Xn{si + EE-’^) 20 

- ' - h/^ 


1 ^ -oTTTt; -h /— 

" - n^NT y/N 

The deterministic oracle inequalities follow upon choosing H = 

To see the uniformity E{si,i',E), note that only properties si, ly and E characterizing a 
and y enter the deterministic oracle inequalities. Hence, the deterministic oracle inequalities 
are uniform over the set J^(si, y,E). □ 
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For the proof of Lemma[S]below, we shall use Orlicz norms as defined in I van der Vaart and Wellner 


(| 19961 1: Let V' be a non-decreasing, convex function with V’(O) = 0. Then, the Orlicz norm of a 
random variable X is given by 


inf|c>0:EV’(|X|/C) < l} , 


imi^ = 

where, as usual, inf0 = oo. We will use Orlicz norms for = '0fe(x) = — 1 for various 

values of b. The following Lemma provides a lower bound on the probability of An- 


Lemma 5. Let Xn = y/4:MNT(log{p V N))^ for some M > 0. By Assumptions [I\ and{^ we 
have 

P(^iv) > 1 - , 

for positive constants A and B. 

Proof. Consider the event {||Z'e||oo > Xn/2} first. To this end, let Zj^i denote the jth entry 
of the Ith column of Z, i.e. the jth entry of ( 2 : 1 , 1 ,/, 21 , 2 ,• • •, 2:i,r,/, 2 : 2 , 1 ,/, • • •, zn,t,i)' ■ Similarly, 
we write Sj for the jih. entry of e. Now note that j “ LtJ^) is a bijection from 

{1,...,N'T} to {1,...,N'} X where [xj denotes the greatest integer strictly less than 

X and [x] the smallest integer greater than or equal to x G M. In case the fth column of Z 
corresponds one of the lags of the left hand side variable, assume for concreteness the /cth lag, 
define Xn = cr ..., y|-ii 1 < j < n) and Sn,i = YTj=i ^j,Pj = 

n—1 


i=i 

= S'n-1,/ + ?/|-^],j_LSjr-feP [s^^-^^n-i:^\T\Xn-l] ■ 

Using that , n — [^JT) is a unique pair (i, t) G {1,..., A^} x {1,..., T} we have that 

P [^r^l n— 1 ] — E[£j,j [T"n— 1 ] = E[ej,^|(T(yj,s, . . . , yi^i—L, ^i,si •••,£/, 1,1 ^ s < t 1)] 

Hwhere the last equality follows from the assumption of independence across 1 < i < (As¬ 
sumption [1]) . By Assumption [U this conditional expectation equals zero as the £i,s are linear 
functions of ... ,yi^s-L and Xi,^. Thus, Sn,i is a martingale with mean zero (the increments 
are martingale differences by the above argument). A similar argument applies when the Zth 
column of Z equals {xi,i,fc, ..., xi,'^,^, X 2 ,i,a:, ■■■,XN,T,ky for some 1 < A: < p* such that every row 
of Z'e is a zero mean martingale. 

Next, note that by Assumption [3l for all 1 < j < NT, 1 < I < p and e > 0, one has 

F{\zj^iej\ >e)< F{\zj^i\ > v^) +E(|ej| > < Ke~^\ 

It follows from Lemma 2.2.1 in Ivan der Vaart and Wellneil (Il996ll that W^iNiWipi < {l + K)/C. 
Then, by the definition of the Orlicz norm, E < 2 . Now use Proposition [2] in 

Appendix B with D = C/{I + K), a = 1/3 and Ci = 2 to conclude 

, V A NT . X 

P [wz'eWoo > ^) < < Ap^-^^"’\ 

1=1 ^j=i ^ 

Note also that the upper bound of the preceding probability becomes arbitrarily small for 
sufficiently large N and M such that we also conclude 


I'^folloo = Op{Xn). 


(7.12) 


®For t = 1, the last expression in the above display is to be read as absence of conditioning on the error terms. 
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Next, consider the event {||L>'e||oo > A 7 v/( 2 \/]V)}. Using Assumption [T] a small calculation 
shows that all entries of D'e are zero mean martingales with respect to the n atural filtration. 
As above. Assumotion 131 and Lemma 2 . 2.1 in Ivan der Vaart and Wellnerl (Il996li vield IIV ’2 — 

(—such that by the second to last inequality on page 95 in Ivan der Vaart and Wellnei 
(|l99fil i one has ||ei,t||^i < ||ei,t||^ 2 (loS ^ (logfor all i and t. Then using 

the definition of the Orlicz norm, E [exp (( 1 +^ 72 )^^^ (log< 2 and Proposition [2] in 
Appendix B with D = ( 1 +^/ 2 ) (log 2)^/^, a = 1/3 and Ci = 2 implies 


PfllD'el 


> 


Aw 

2 //V 


N T 


i=l 


t=l 


> 


\n 


2y/NT 


T < 




Note also that the upper bound of the preceding probability becomes arbitrarily small for 
sufficiently large N and M, such that we may also conclude 


iL^'elloo = Or, 


Ajv 

y/N 


(7.13) 


□ 


The following lemma shows that K^('I'Ar) si, S 2 ) and k^('I', si, S 2 ) are close if 'Lw and T are 
in some sense close. 

Lemma 6. Let A and B be two positive semidefinite {p + N) x (p + N) matrices and 6 ;= 
maxi<jj<p+ 7 v \Aij — Bij\. For any integers ri E {1,... ,p} and r 2 E {1,..., A"}, one has 

K^{B,ri,r2) > K^{A,ri,r2) - 625(ri +r2). 

Proof. Let x be a (p + N) x 1 non-zero vector, satisfying ||xi?c||i < 4||xr||i for R = RiU (i? 2 +p) 
where i?i C {l,...,p} with |i?i| < ri, and R 2 C {1,...,A^} with |ii 2 | < Now, 

|xMx — x'BxI = |x'(A — B)x\ < ||x||i||(A — B)x\\oo < ||a^||i<^ = h (||xi{||i -|- ||xijc||i)^ 

< <5 (IIxrIIi -F4||xij||i)^ < (525||xij||i. 

Hence, 

'P /4 T* 

——-[To > ——[]-[To - <525(ri -g r 2 ) > k^(A, n, r 2 ) - 625{ri + r 2 ), 

where the last inequality is true because of the definition of k^(A, ri,r 2 ). Minimising the left- 
hand side over non-zero x satisfying ||xijc||i < 4||xi{||i yields the claim. □ 


Define 


Bn = < max 




N,ij ^ij \ — 


nK^z,si) 


50 


Si -I- U 


Fnt 


Setting A = T, H = 'Lw it follows from Lemma[ 6 ]that Bn F Bn as k|('L^,si) < si, S 2 ) 

for all S2 E {1,...,A^} as argued prior to Assumption [2l Thus, we just need to find a lower 
bound on P(H 7 v) in order to prove Theorem [TJ 
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Lemma 7. Let Assumptions and 0 hold. Assume that si + E ^ V^- Then, 

there exist positive constants A, B such that 


P(i3^) < P(i3^) < + pN) exp {-B\n/ 


\ -i' 2. i/3\ 

] } ) ■ 

Proof. Since the lower right N x N blocks of 'kw and 'k are identical, it suffices to bound the 
entries of j^Z'Z — ■j^E[Z'Z] and jA—Z'D. A typical element of j^Z'Z — ^E[Z'Z] is of 

the form J2j=i{zi,t,lZi,t,k - Hzi^t,lZi,t,k]) for some l,k £ {1,... ,p}. By Assumption [3] 

we have for every e > 0 

^{\zi,t,izi,t,k\ > e) < ^{\zi,t,i\ > V^) + F{\zi^t,k\ > \/e) < Ke~^\ 

It follows from Lemma 2.2.1 in van der Vaart and Wellner ( 1996li that \\zi^t,iZi,t,k\\ii)i < (1 + 
K)IC. Hence, by subadditivity of the Orlicz norm and Jensen’s inequality 

T 




t=l 


bi 


„ „ 2 ( 1 + A') 

< 2 max \\zi^t,iZi,t,k\\i,i < - 7 ;-■ 

l<t<T ’ ’ ’ ' "'1'^ C 


Thus, by the definition of the Orlicz norm, Eexp ( 2 ( 1 +^-) |T “ ^[zi,t,iZi,t,k])\) < 

2. Using independence across i (Assumption [T|) to invoke Proposition [2] in Appendix B with 
D = 2 (i+k) ’ « = 1/3 and Ci = 2 such that for every x > 

N T 

^ (]E T J2^^ht,iZi,t,k - nzi,t,iZi,t,k])\ > Nx) < (7.14) 

i=l ^ t=l 

for positive constants A and B. 

Next, consider jA—Z'D. A typical element can be written as Ylt=i Zi,t,i for some 

i G {1,... ,N} and I £ {1,... ,p}. By Ass umption [3l we have f‘{\zi±j\ > e) < for all 

e > 0 and it follows from Lemma 2.2.1 in van der Vaart and Wellnen ( 1996l l that ||2i,i,i|lb2 — 


^ . Hence, 




b 2 


1 „ „ 1 /1 + A / 2^^/2 

< max Uitz U, < 


c 


c 


Thus, it follows by Markov’s inequality, positivity and increasingness of '02 ( 2 ;); as well as 1 A 
= 1 A (e^^ — 1)“^ < 26“^^^ that for any x > 0 


t=i 


> X I < 1 A 


1 


^{xVnIC'P _ I 


< 2 e-t^ < 


(7.15) 


where the last estimate follows by choosing A and B sufficiently large/small for (I7.14p and (17.151) 


both to be valid. Setting x = 


50 


Sl+E 


^/nt 


50 


si-\-E 


£nt 


—, using that 


> 




and being bounded away from 0 (Assumption [5]) , we have 


P(H^) < P(H^) = P max N,ij - ^ij\ > x 

\l<ij<p+W 


< A(p^ +pN) exp 

< A{p'^ + pN) exp [ —B < N/ 


k750 


n 2 




-Si + E 
si + E 


Ajv 

Vnt 
Xn 


1/3, 


7 . 

-. 2 . 1/3n 


V exp —B 


kV50 


n 2 




N 


Vnt 
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where the last estimate has merged (k^/ 50)^/^ into B. 


□ 


Proof of Theorem [IJ Theorem 1 follows by combining Lemmas 010 and [71 


□ 


Corollary 1. Let the conditions of Theorem{^ hold. For large enough M > 0 and assuming 

-L--, 2 


(log(pVAf))3 


si+E 


Vnt 


F{si,iy,E). 


N 


NT 


= o(l), we have the following stochastic orders valid uniformly over 


|d — a||i = Of 


^Vnnt 


— Op ( Si 


Xn 


+ Op E 


Xn 


1-1/ 


y/NTj V ^^/NT' 

Proof of Corollary m Given positive constants A and B, and become 


arbitrarily small for large enough M >0. By 

l/3\ 


(log(pVAf))3 


S1+-E 


Vnt 


N 


= o(l), /4(p2 + 


pN) exp —B < N/ 


Si + -E 


Cnt 


0 as —)• oo. Thus the lower bound on the 


probability in Theorem [T] goes to one as —?■ oo for large enough M > 0 and the conclusion 
follows from Theorem [H □ 

7.3 Proof of Lemma m 

The following lemma gives the rates of the uniform prediction and estimation errors for nodewise 
regression. 


Lemma 8. Let Assumptions\^ 0 an d\^ hold. Let Xnode = -\/l6M(logp)^/A^ for some M > 0. 
For M sufficiently large, we have 


max ■ 


max Wij -<pj\\i=Op (OA'll,) 


jeHi 

1 


max 


j&Hi NT 


l-^—iCl'Iloo — Opi^Xfiode) ■ 


(7.16) 

(7.17) 

(7.18) 


Proof. We say that a (p — 1) x (p — 1) matrix A satisfies the compatibility condition CC{r) for 
some integer r E {1, ... ,p — 1} if 


{A, r) := min 


5'A5 


mm 


ijc{r“.7p-i} 5eiR7-f ({0} 111 11 f 


> 0 . 


Define a (p — 1) x 1 vector (ft such that 

4^j,k ■ — -^node} k 1, . . . ,p 1 
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and its active set J* as well as its sparsity index s* 

j; = 1 < s* := |j;| < p - 1. 

Consider the events 

'n f W ry! /- W ^ ^node I 




Kj 




and 


Fn = 


1 

1 


NT 


Z'Z -'^z 


— ^node f • 


Using the same technique as in Section 6.2.3 of lBhhlmann and van de Geen ([20111), we arrive 


tile- 00.1110 tooiiiiii^uo oo iii kJOOtiu»ii kj.^.kj v_;i |_j^_uiiiii^ ___ 

at the following oracle ineq u ality, which is almost the same as the one on the top of pill of 
Biihlmann and van de Ge"^ ( 2nill l: for each j G Hi, on Pjv G Snj 


jyj^\\Z-j{^j 4'j)\\ F Xnode\\$j 4'j\\l ^ j,^j,\\Z-j{(pj 4>j)\\ + 


^ 8 XLdeS* 


“1“ '^node||'?^j 4^j 


< 


-112’-, 

A^r" ^ 




pj - (PjU 
(7.19) 


where the second inequality is due to event Enj and that z-j-jH) > ('h^,r) for all 

j = l,...,p and r = l,...,p- 1. 

We now bound the three terms on the right hand side of (|7.19D . Let bj := (p’j — (pj- 

< maxeval(^'z,-i,-i)||^j|P + ■^Z'_jZ_j - ^z-j-j \\bj\\l < maxeval(^'z)||6j|p + Xnode\\bj\\l 
where the last inequality holds on event Hn. Note that 


p—1 P—^ 

ll^ill = < Xnode} < -^node II < Xnode} < ^jX^ode' 

h — 1 t*— 1 


P-1 


p—1 P~1 

ll^illl = '^\^j,k\H\(pj,k\ < Xnode} < Xl~ae'^\^j,kfH\^j,k\ < Xnode} < GjX^ 
k=l k=l 


-'d 
node' 


p—1 p—1 

1<S*=Y. im,,k\ > Xnode} = Y. > Ziode} < GjX 

k=l k=l 


-1? 

node' 


Thus, for each j £ Hi, on Vf^ n Enj G Fn 

1 


NT' 


Z—j{(pj */’j')ll T AfiocJe 11 ((’j 


< + OjAf,;2’ + 


-,2\3-2d 


96 




/O \2—I? I ^ \2—1? 

*\^j\ode -r 


^3 ^^node 


= ( maxeval(T^) + 


96 




4-1 1 r’ ■ 4- r'^ 

k‘^{^Z, S*) ^ Fiode “T '^j^node 


(7.20) 

(7.21) 
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from where we can extract two oracle inequalities 

1 


96 


< (^maxeval(^'z) + 


+ i I Lrj Anode + '-"j "'node ’ 


— ^j\\i < ( maxeval('hz) + 


96 




I 1 1 ^ . \1—I /^2 \2—2i? 

+ -*- I "'node + \ode ' 


As the oracle inequalities in the above display are valid simultaneously on 'D]\fr\{r\j£Hi£N,j)(~^J^N 
we conclude that 


1 


max ■ 


\Z-j{$j — 4>j)\\‘^ < ( maxeval('I' 2 ) + 


96 




+ 1 I GXti + G"A: 


-2x3-27? 
node ’ 


max 

j&Hi 


— < ( maxeval('I'z) H-: 


96 




^ ^ ^ ^ (7.22) 


=(2 \2-2d 


on Ptv n (rij-gHj^Arj) n Fn- 

Next, we establish a lower bound on the probability of Vn H {f\j^H^_£N,j) H Fn- Consider 


Vn first. A typical element of Z'_-C,j is of the form X]i=i Xi=i some I ^ j. By 

P-12P, one has ^ Xlili TI=i - ^Zi^t,lCj,i,t\) for I / j. By 

Assumptions [3] and |lKc) , it holds for any e > 0 that 

^{\zi,t,iCj,i,t\ > e) < ^{\zi,t,i\ > Ve) +P(ICi,i,t| > Ve) < Ke~^\ 

such that Lemma 2.2.1 in van der Vaart and Wellner ( 1996l l yields that \\zi^t,lCj,i,t\\i’i ^ (1 + 
K)IC. Therefore, by Jensen’s inequality and subadditivity of the Orlicz norm 




t=i 


„ ^ „ 2(1 +AT) 

< 2imax ^ -■ 


ipi 


Using the definition of the Orlicz norm Eexp ( 2 ( 1 +^) 1 + XLi {zi,t,iCj,i,t - E[+,t,;Cj,^,t])|) < 2. 
Using independence across i (Assumption [1]) to invoke Proposition [2] in Appendix B with D = 
(7/(1 + K), a = 1/3, Cl = 2 and e = Xnode!^ > we conclude (using hi < p) 


N 


f( 


max • 


1^1 


V 'edT ^ — hipF ^1^^ J, ''y^{zi,t,iC,j,i,t ^[zi,t,iC,j,i,i\) > 

^ ^ i=i t=i 


for positive constants A and B. The upper bound of the preceding probability becomes arbi¬ 
trarily small for M sufficiently large such that 

T 7 +11 7 Cj 11 00 Op{\node)i 

j&Hi ly 1 

which is (j7.18p . In order to provide a lower bound on the probability of {f^j^Hl£N,j) define the 
event 


£N,j 


< max 

},^Z' F 


[iVT ^ 


Ik 


- [^Z-j-j\lk 
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by Proposition [T] in Appendix B with A = ^z-j-j, B = 


r = s* and S = 


32s* 


— Observe that the relation 
1 


max 

l</,/c<p—1 




J Ik 




nik 


< max 
l<Z,fc<p 


NT 


-Z'Z 


- [^z] 


Ik 


Ik 


< 


K2(^'z,maXjgHi s*) K^('Ifz-j-j,s*) 


32GX 


-1? 

node 


< 


32s* 


implies £n ■= |maxi<;^fc<p| z]ik\ < | ^ £M,j C fjvj for all j £ 

Hi and hence £n C rij-gHi^ATj- It remains to provide a lower bound on P(£lAr). A typical element 
of -^Z'Z-^z is of the form Ya=i Y.^=i{zi,tHiTk-^.Zi,tHi,t,k]) for some Z,/c G {!,... ,p}. 
Invoking ( | 7.14 | ) with x = ^ ~ (using = 0{\og^/'^p), implied by 

Assumption HKb)) 


N T 


( ^ J] ^ - ^[Zi,t,lZiTk]) >X^<Ae 


i=l t=l 

for positive constants A and B. Therefore 

1 


P(£l^) = P max 
\l<l,k<p 


NT 


Z'Z 


-[T 


Zjife 


Ik 




The upper bound of the preceding probability becomes arbitrarily small for M sufficiently large 
(using = 0(1), implied by Assumption SKb)). In a similar manner, invoke (I7.14p with 

X = ^ (M > 0), 


P(J^^) = P f max 

\l<l,k<p 


NT 


-Z'Z 


J Ik 


- I'l’zlttl > x) < Ve-"'*’"’''’ = 


for positive constants A and B, letting B absorb the extra constants. The upper bound of the 
preceding probability becomes arbitrarily small for sufficiently large N and M. We also have 


Z'Z 


NT 


- ^z 


= Op(A„ode) = Op 


Lastly, use Assumption HKb) in the display (l7.22l) to get the claimed orders. 


(7.23) 

□ 


Proof of LemmaUl Recall (|3.6p and use Zj = Z^jcpj + Q: 




Cv'Ci + js^rpCjZ-j4>j ^-j^j NT^^^ 


Thus, 


I -'2 2 

max Tx — Tx 
jeHi ^ ^ 


—C'C-T^ 


+ max 

j&Hi 




- M'Z'Nj 


NT 


4 - max 


■J'rj 




i>^)'Z'_^Z.,f>^ 


(7.24) 


Consider the first term on the right of the inequality in (I7.24p . By Assumption HKc) , we have 
for all e > 0, P(|Cj,i,tl > e) = IPdO.i.tl > \/e) < H follows from Lemma 2.2.1 in 
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van der Vaart and Wellnei ( 19961 ) that ||Cjjtllv>i — (l + -^/2)/C'. Therefore, by Jensen’s in¬ 


equality and subadditivity of the Orlicz norm 

T 


1 


t=l 




s 2 max llcii 


< 


2 + K 




Ipl 


Using the dehnition of the Orlicz norm, Eexp Ylt=i iCjn ~ ®[Cji 4 ])|) — 2- Using in¬ 

dependence across i = 1,... ,N (Assumption [1]) to invoke Proposition [2] in Appendix B with 
D = 0/(2 -|- K), a = 1/3 and Oi = 2 for x > 


N T 

*■ (I V S / E (dm - E|C|„1)| > x) < 

i=l t=l 

for positive constants A and B. Setting x = gome M > 0, we have 


P 




T 

i=l t=l 


N T 

1 A 1 


sEf vErE(0<-E|cUi) 

j£Hi ^ i=l t=l 


> 


> 


M(log hi) 3 
N 




N 


Recognising that the upper bound of the preceding probability becomes arbitrarily small for 
sufficiently large N and M, we have 


max 

j&Hi 


1 


_ ■ — T-? 


= 0 , 


(log hif 

N 


Op{Xnode) • 


Now consider the second term on the right of the inequality in (|7.24l) . Recall that 

^ 1 —(Til o • • • — {h^ ^ ^ 


c = 


V 


1 

-4>1,2 ■ ■ 

4^1,p 

4>2,1 

1 


4‘p,i 

-4>p,2 ■ ■ 

1 


/ 


such that Cj is the jth row of C but written as a p x 1 vector. Then 


max||(/)j||i = max||(/)- 

j&Hi 


1 ^ 


< rnax ||0* - + rnax \\cj)*\\i < GX^J^ + max \\(j)*\\i 


j&Hi 


j&Hi 


j&Hr 


< <^X^noIe + max ^s*U*\\< GXl^J^ + max ^ s*\\cj)j\\ < GX^J^ + max ^ s* 11 Cj \ 


< GXlS +max Js 


C'^zC, 


/O \ 1 — 

jeHi V O y mineval('I'2) je%i 


^z,j,j - ^Z,j-j^z^-j -j^z-j, 


h-] 


Y^mineval('k2) 


< GXl^l + rnax 




iei^i y^minevayT^) 


7 7 a s*\ maxevahTz) 

< 0\'J, + max ^ . - = o(a^r‘\-^) 


JeHi y^minevayT^) 


(7.25) 


where the second inequality is due to (|7.20p . the second equality is due to (13.IR . the seventh 
inequality is due to that Assumption UJ) a) implies that -j 1® positive definite for all j G Hi, 

and the last equality is due to (17.2111 and Assumption HKb)). Now, 


max 

jeHi 


1 


C7^-7 


jyjaSj JYJ 


< max 

j&Hi 


1 




Ik.lli) = OpiX^ode)0{G^Gx-^G) = Op{G^Gxl-J/^), 
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where the first equality is due to (|7.18p . 
The third term in ()7.24l) is bounded as 

1 


max 

j&Hi 


-XcP, - c^X'Z'-jC, 


< max 
jeHi 


1 




= OpiOXll), 


where the equality is due to (I7.17P and (|7.18l) . 

To bound the fourth term on the right of the inequality in (I7.24p , recall p3.8D and manipulate 
to get j^Z_jZ—j[Xj Xj) ~ j^Z_j(^j XnodeWj- Thus, 

1 




where the equality is due to p7.18p . Thus, 



1 , 

< 


OD 


T -^node 11 11 oo — Op(Xnode)j 


max 


NT 


(Xj 4>j) z_jZ-j<pj 


< max 


NT 




rnax||(/)j||i = Op(G'^/^A^ J/^), 

where the last equality is due to (17.251) . Summing up all four terms on the right of the inequality 
in p7.24p . we get 

max \ff - t| 1 < Op (A„p*) + + Op(GA=;?J = 0,(5'/^;^^) = Op(l), 

where the first equality is due to that Op{G ^^'^dominates Op{GX‘^^^) by Assumption 
Sib), and the second equality is also due to Assumption |Hb). This establishes (I3.14p . 

We now prove p3.15p . We first recall 

a N T ^ ^ 


t/ = E 


i=l t=l 




Furthermore, 




3 


e'^Ozei 


|2 — 


< max 


S'QzS 


< 5 eRp\{o} 


= maxeval(0z) = 


mineval(T^) 


The preceding inequality is uniform in j. Thus, min^gj/j tJ > mineval('I'^), which is uniformly 
bounded away from zero by Assumption (Ha). Therefore, 

min ff = mm(ff — rf + rf) > min rf — max — t?| > minevalfTz) — Oofl). 
j&H^ J j&HT ^ ^ ^ j&Hi ^ jeHi' J J' y j P\ j 


Hence, we conclude that min^g^f^ r? is bounded away from zero for N large enough and 


maxjgj^^ = Op{l) which establishes (13.151) . 
b 

Hence, 

1 


max 

jeffi 


1 


f2 r2 
3 3 


maxjgj/jr^ - f2| i 

< - -— „ ■ max —^ = max 

minjgHi rj je//i rj j'eHi 


r| - P-\OmO,{l) = Op(G‘''U;j'f), 


which establishes p3.16p . 

We can now bound \\^z,j — ©Zj lli- Use the definition of Gj and (I3.10p to recognise 

that ezj = GjQzjj = Gj/rf. 


max 




Cj 

Cj 


1 

1 



^7 

Qz,j - &Z,j 

= max 

1 

n 

rf 

= max 

, 1 


+ max 
i6J?i 

' J 

-1 

' J 

i 


< max -o + rnax ^ ^ + max 

jeHi ff Tf j&Hi Tf Tf , jeHi 


^3 


1 


= max —^-^ + max —^ 6j — 

jem Tf Tf jeHi Tf ^ 


+ max \\(Pi\\i 

1 jeHG' 


f2 r2 
3 3 


= Op{GX 


l-d A 
node /’ 
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which establishes (I3.17p . Next, we bound maxjgj:/^ \\^z,j — ©Zjll- Since 
Z'Z 


max 

i6Hi 


(C, - c,y^{c, - C,) - {Cj - c.y^ziCj - c,) 

{Cj-C,y^z{c,-Cj) 



Z'Z 


< 

- z 

NT 

00 


< max 

j&Hi 


(Cj - Cjy^(Cj - Cj) 


+ 


Z'Z 


Consider the first term on the right hand side of ()7.26p . 


max 


(Cj Cj) (Cj Cj) 


1 


= max ■ 


j&Hi NT 


Z(Cj-C,) 


NT 


1 


— ^ z 


Cj - C, 


max 


Cj - Cj 
(7.26) 


= max ■ 


j&Hi NT 


Z-j(c^j-<^j) =0,(GXl-i), 


where the last equality is due to (I7.16D . Next, consider the second term on the right of the 
inequality (I7.26D . We have 


Z'Z 


NT 


— ^z 


max 


Cj - Cj 


Z'Z 


NT 


— '^z 


max 


= 0„ G^X 


--i2 \3-2i? 


'node ) ’ 


where the first equality is due to the definitions of Cj and Cj, and the second equality is due 
to (|7.17p and (I7.23p . Adding up the two terms, we have 


max 

jeHi 


(Cj - Cj)'^z(Cj - Cj) < Oj,(GXl-J) + O, G^X 


y2\S-2'& 

^node 


= Op(Gxl-i), 


where the last equality is due to Assumption HKb). Since maxj^Hi \ (Gj — Cj)''^z(Gj — Cj)\ > 
mineval('I'^) maxjgj:/^ \\Gj — Gj\\^ and mineval('I'^) is uniformly bounded away from zero we 
have maXjgHi \\$j - (t>j\\ = max^gj/^ \\Cj - Cj\\ = Then, 


max 


0Z,i - 0z,i 


= max 


ff rj 


< max 

j&Hi 


1 1 


(j)j (j)j 

?2 " 72 

+ max 




1 

1 

1 



1 

1 

< max 

/s 9 

9 

-|- max —TT 

(fj — (fj 

+ max d)J 

9 


iehfi 



NHi rf 

jeHi 

rf 

rf 




where in the last equality we have used that maxj^Hi ||'/’j|| = 0(1)) which follows from in¬ 
specting the arguments in p7.25p . We have hence established pS.lSp . Finally, recall that 


^Z,j 

= CjQz,j,j = Cj/rl 



max 0zy 1 

j£Hi 

Therefore, 



max 

jeHi 

&Z,j 

< max 

1 j&Hi 

&Z,j 


j&Hi 


j(^Hi 




l/rj = 0(G‘/U-^'?). 


(7.27) 


+m^ ||0z,illi = 0,(GA‘;S'j+0(G‘/2^-= 0 ,(G'Ga,7„T). 

1 j£Hi 


where the last equality is due to Assumption SKb) . 

7.4 Proof of Theorem [2] 


□ 


Proof of Theorem [H The following assumption is implied by Assumption 151 ^1 However, as As¬ 
sumption [5] is much simpler, we have chosen to use the latter in the main text even though it is 
slightly less general than the following assumption. Note again how the assumptions simplifies 
when either hj or /i 2 equals 0. 


®To be precise, Assumption IHa) implies Assumption ETa) by recognising that hi > 1 if hi 7 ^ 0 , and 
G j > sj > 1 . Assumption[Sfb) implies Assumption[Sfb) by recognising that 

•\J implied by Assumption a) provided hi yf 0 and h 2 yf 0, respectively. Last, Assumption 

[5jc) implies Assumption | 6 pc). 
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Assumption 6. 

(a) (i) 

(ii) 

(in) 


(iv) 

(v 

(b) Let 



(logpp 

N 

(log(pVT))« 

— i ■ 

hi/iaG 

{logp)3 

N 

V 

1 -t?/2 

{\og{pVNVT)f 

hiG'^ 

(logpp 

N 

N ~ 

-i5 

(logp)3(log(pVAf))3 

hiG 

(logpp 

N 

N ~ 

-i>/2 

(log(iVVT))2(logp)2 


NT 


= o(i); 

= o(i); 

= o(l); 


(v) )MELlp{h2±^ = o{l). 


a := 


siM E 


(log(pVjV))3 

T 


-vji 


(log(p V A))" 


(i) h\G'^ 

(ii) hiG 


(logp)^ 


N 


(logp)^ 


-1? 


1 V 


N 


NT 

a = o(l); 


1 -A2 


N 


log(p V A V T)a = o(l); 


(iii) hih2G 


(logp)^ 


N 


- i ?/2 


(log(p V A V T))‘^a = o(l); 


(iv) \ hih 2 G 


(logp)^ 


3 1-A2 


N 


Alog(p V A V T)a = o(l); 


(v) Nhj (^1 V a = o(l). 


(c) 


(/ii V/i 2 )(log(pV A))3 Isjv /(log(^M 


3\ -I"' 


A 


= o(l), 


where b := 


G 


(logp)^ 


31-A2. 


N 


log(p V A) V (logp)'*) l{/ii / 0} V log(p V A)l{/i 2 / 0} 


(d) mineval{Tjie) is uniformly bounded away from zero and maxeval{T,i^]\f) is uniformly bounded 
from above. 


t = 4MzjLAN(0,l). 


We show that 


V'esn.0'p 

To this end, note that by (13.3p one may write t = ti +12, where 

p'QS-^U'e , -p'A 

ti = — . ^ and t 2 = 


p'eTus&P 


p'eTnsQ'p 


Defining 


, _ p'05-in'e 
~ Vp'0Sn.0V 
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it suffices to show that t'l A A^(0,1), t'l — = Op(l), and t 2 = Op(l). In the sequel we first show 

that ti — = Op(l), then A A^(0,1) and finally t 2 = Op(l). To show that ti — t[ = Op(l), it 

suffices to show that the denominators as well as the numerators of ti and are asymptotically 
equivalent since 

pQT,Yie&'p > mineval(Sne) (mineval(0))^ = — ^ (7.28) 

(maxeval('I')) 

which is uniformly bounded away from zero by Assumptions UK a) and[6Kd). 


7.4.1 Denominators of ti and t'l 

We first show that the denominators of ti and t'l are asymptotically equivalent, i.e., 

|p'0Sn.0V - p'&^ue&'p\ = Op{l). (7.29) 

Write 


iPl,P2) 


&z'^1,nQ'z Qz'^2,N 


Pi 

P2 


iPl,P2) 


&Z^l,N&'z Qz^ 2,N 
^2,N®'z ^3,Af 


< \p'iQz^i,nQ'zPi — PiQz^i,nQzPi\ 

+ '2\PiQz^2,NP2 — p[Qz^2,NP2\ 

+ 1 7 * 2 ^ 3 , Af 7*2 - 7 * 2 ^ 3 , Af 7*2 1- 

To establish (|7.29l) . we show that (|7.30p . (I7.3ip and (I7.32p are Op(l), respectively. 


Pi 
P2 

(7.30) 

(7.31) 

(7.32) 


(|7.3Up is Op(l): 

Define := Ylt=i To show that (I7.30p is Op(l), it suffices to show that 


\p'i&z'^i,nQ'zPi — Pi®z^i,N&'zPi\ = Op(l) (7.33) 

\p[Qz^i,nQzPi ~ Pi®z^i,N&zPi\ = Op(l) (7.34) 

\p[Qz'^i,N&'zPi — PiQz'^i,n'S>'zPi\ = Op{l). (7.35) 

We prove (j7.33D first. Note that 

\p'iQz^I,nQ'zPI ~ p'i^Z^I,nQ'zPi\ ^ l|Tl,Af — Si^Ar||^ l|0zPl|li • 

First, 


l|0zPi|li = 


< \pij\ 

&Z,j 


16/^1 

1 





(7.36) 


where the last equality is due to (13.1911 . 
-Pi = - z'iA^ -a)- {fii - Pi) = 


We now bound 



— ^l,N 


Since ii^t = Vpt - 


OO 

• Ei.t — — 7 )) substituting for we have 




^l,N — Si, AT 

= 

OO 


N T 


N T 


-^2222 


i=l t=l 


< 2 


N T 




i=l t=l 


+ 


i=l t=l 
N T 


j;^'22Y^ - 7 )]' 


i=l t=l 


(7.37) 


38 

















































Consider the first term of (|7.37l) . A typical element of ^ Z]t=i ~ 7) i® 


NT ,NT . .NT 

jpf Yl ^Tl^TkSjTT'j (7 - 7 ) < _ ( ^ 

j=l \ = 1 / \ = 1 


1/2 ^jvT s ^/^ 

.i(7-7)f 

'j=i 


.NT 
^ i=l t=l 


p 2 

k^i.t 


1/2 


1 .. 2^'/2 


NT 


|n(7-7)||‘ 


(7.38) 


for some l,k G {!,...,p}, where the inequality is due to Cauchy-Schwarz inequality. Use 
independence across i (Assumption [T]) and subgaussianity (Assumption [3]) to invoke Proposition 
[3] in Appendix B, such that 


max max 


1 


NT r 

Y1 = Op (y 


i=l t=l 


'(lo g(p^r))' 

N 


and 


max max max max 'E\zf.,zf.i,£jA < A = 0(1) 
l<l<pl<k<pl<i<N i<t<T ’ ’ ’ ’ ’ 


for some positive constant A. Then, by the triangle inequality, 

N T 


max max 


1 


Y1Y1 ^lt,i^lt,k^lt 


i=l t=l 

Combining (j7.38D and (j7.39p . we have 

^ N T 


= 0r. 


(log(p VT))7 
N 


i=l t=l 


= o„ 


(iog(pvr))7/4 

ivV4 


V 1 


NT 


+ 0 ( 1 ). 


|n(7 -7)||‘ 


(7.39) 

1/2 

(7.40) 


We now consider the second term of (j7.37p . A typical element of ^ YY=i 

-i)f is Ya=i - 1 )? ^ maxi<i<Armaxi<t<r |+,t,z+,t,fc|]^||n( 7 - 7 )|p for 

some l,k G {1,... ,p}. Recall that we have proved in the proof of Lemma[7]that ||+,t,z+,t,fc||i/;i ^ 

(l+A')/0. Using the definition of the Orlicz norm, we have Ee 1 +^< 2 . Using Markov’s 
inequality, we have for any e > 0 


P 


( max max max max R,-+/z,-/ i-l >e 
Vl<Z<pl<fc<pl<i<Ari<f<r' ’ ’ ’ ’ ' “ 


P P ^ T c I I 

E E E E W. s 2ntpN^‘. 

l=lk=li=lt=l 


Set e = M\og{p^NT) for some M > 0 and note that the upper bound of the preceding proba¬ 
bility becomes arbitrarily small for N and M sufficiently large. Thus, 

max max max max |+tz+zzcl = 0„(log(p^A'T)) 
l<l<pl<k<pl<i<Nl<t<T' ” ” ' 


and we get 


1 


N T 


AM,til - 7 )] 


i=l t=l 

Combining (j7.4nD and (j7.4ip . conclude 
Egzv — ^l,N 

'{\og{pVT)yO 


2 =Op(log(pViVVT))^||n( 7 - 7 )f- 

CXD 


(7.41) 


= 0r. 


iVl/4 


V 1 ) ||n (7 - 7 )If) + Opiiogip V iv V T))^||n (7 - 7 ) 11 ^ 
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Therefore, combining the preceding rates with (17.361) one gets 


\p'iQz'^i,nQ'zPi ~ p'i^z^i,nQ'zPi\ 

(log(p V 




Ari/4 


V 1 


NT 


|n(7-7)|| 


+ Op(/^iGA-l)Op(iog(p V iv V r))^||n(7 - 7)11 


1 


NT' 


— Op(l), 



where the last equality is also due to Assumption [ 6 l^b)(i)-(ii), which establishes (|7.33p . 

Next, turn to (j7.34l) . Note that 

\p'iQz^i,nQ'zPi ~ p'i^z'^i,nQ'zPi\ ^ ||^iN “ ^i..'v||oo||0zPi||i • 

Given (|7.36p . we only need to consider ||Si^Ar — Si^Ar||^. Using independence across i (As¬ 
sumption [1]) and subgaussianity (Assumption [3]) to invoke Proposition [3] in Appendix B such 
that 


U,7V 


- s 


1 VI = max max 
’ °° i<«<pi<fc<p 


1 

Wt 


N T 




i=l t=l 


= o„ 


(log(p2T))5 

N 

(7.42) 


Thus, 

\p[Qz'^i,nQ'zPi — Pi&z'^i,N&'zPi\ 




(iog(p vr))5 

N 




hlGKo,e 


Op(l), 


where the last equality is due to Assumption [ 6 K a) (i), establishing (|7.34p . 

To prove (I7.35P invoke Lemma [9] in Appendix B: 

\p'i&z'^1,nQ'zP1 ~ p'i'S>z'^1,nQ'zPi\ < ||5ill,7v||oo||(0Z — 0z)pi||l + 2 ||Bi^7v0zPi|| ||(0Z — 0z)7'l|| 

^ IISi^ATllcxDII(©z ~ 0z)pi111 + 2maxeval(Ei^7v)||0zPiIIII(©z ~ ©z)PiII■ 


First, note that ||Bi^ 7 v||oo is uniformly bounded as every entry is an average of uniformly bounded 
population moments (see Proposition [3] in Appendix B). 


II(©'^ - ©^)pi||i < ^ ||0z,j - ©ZjIIi IPijl < max||02j - QzjW^ 

NHi ^ ^ 


= OplG 


(logp)' 

N 


\/^) — 


(7.43) 


where the first equality is due to (|3.17ll . and the last equality is due to Assumption[UKa)(i). Next, 
||©z/^ill — maxeval(©z)||/Oi|| < maxeval(©z) = l/mineval(T^), which is uniformly bounded 
from above by Assumption |4](a) . Furthermore, 


11(02 - ©z)Pi|| = II ^ (©Zj - Qz,j)pij < ||0z,i - ©Zj \pij 

jeHi jeHi 


jeHi ' 


(g^G 

■(logp)^' 

V 

[ N \ 


2-'6 

4 


y/hl] = Op(l), 


where the second last equality is due to (13.181) . and the last equality is due to (|7.43l) . Thus, we 
have established (|7.35p concluding the proof of (17.301) is Op(l). 
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(fT^B is Op(l): 

Define 1^2,N ■= Eili Ylt=i I* suffices to show 

\p[Qz'^2,NP2 — p[Qz'^2,NP2\ = Op(l) 

\p'l&z'^ 2 ,NP 2 — p'iQz^ 2 ,NP 2 \ = Op(l) 

\p'iQz^2,NP2 — p'i&Z^2,NP2\ = Op{l). 

Consider ()7.44p first. Note that 

\p'l&Z^2,NP2 — Pi&z'^2,NP2\ < Pl^Z {^2,N “ ^2,n) 

\ / CXJ 

- ||Pl0^||i||^2,Ar - S2,Ar||^ = Op {\[hih20}^^) ||S2,iV “ ^2,n\ 

where the last equality is due to (I7.36D . In addition, 


(7.44) 

(7.45) 

(7.46) 


IIP 2 II 1 




^2,N — ^2,N 

= 

OD 


N T 


N T 




< 2 


7^ £ "Z, *i,<4,1^1,1’TyW - 7 ) 






N T 


+ 


N T 


7^ ^2 - 7 )]' 


(7.47) 


IS 


Consider the first term of (j7.47p . A typical element of Yli=i Z]t=i til ~ l) 

NT /NT \ C2 /jvT \ 

7^ E ^TldTkSjTT'j (7 - 7) < (^E j (^E [4 (7 - 7)]'j 

( NT \ / T \ 

^EE^tAfc^tj ^=l|n(7-7)|| = f ^=||n(7-7)|| 


for some I G {!,...,p} and A: G {!,...,A^} where the inequality is due to Cauchy-Schwarz 
inequality. By subgaussianity, Assumption [31 we can use the same technique as in (|8.3p in 

d | —2 2 

j, 2^t=i 


Proposition [3] in Appendix B to prove Ee O ■ 
Using Markov’s inequality, we have for e > 0 


< BT for positive constants D,B. 


55Svi^E7,w4, >d<EE 


\l<l<pl<k<N\T 


t=l 


P ^ Ee^l 2 


1/2 


„Del/2 


< BpNTe 


-L>P /2 


1=1 k=l 


Set e = M (log(pA'T))^ for some M > 0 and note that the upper bound of the preceding probabil¬ 
ity becomes arbitrarily small for N and M sufficiently large. Thus, maxi</<p maxi<fc<jv 7 Ylt=i 4 1 i^k t 
Op{{log{pNT))‘^). Therefore, 


N T 




< Op{log{pNT)) 


y/NT 


til-l) 
n(7- 


1 


1/2 


< max max — zl , ,e? . 
\ l<l<p l<k<N T ^ ’ ’ ’ 


t=i 


y/NT 


|n(7 -7)|| 


(7.48) 
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Now consider the second term of (j7.47l) . A typical element of Ylt=i i(7~7)]^ 

is 


N T 


Y. Yj - 1)? < 

V j=l t=l 


max 
l<i<N l<t<T 




for some I G {I,... ,p}, k G {1,..., N}. Using Markov’s inequality, we have for any e > 0 


P ( max max max \zktl 
\l<l<pl<k<N i<t<T ’’ 



V N T 

- > e) <pNT—e-^^\ 


Set e = y/M log{pNT) for some M > 0 to see that the upper bound of the preceding probability 
becomes arbitrarily small for N and M sufficiently large. Thus, maxi<i<p maxi<fc<Ar maxi^t^^ \zk^t,i 
Op{^/^og{pNT)). In total, 

< max max max VN\zk,t,l\z^\\'n{'j - 
l<l<p l<k<N l<t<T ’’'NT" 

= Op(Viv iog(pivr))^||n(7 - 7)||2. (7.49) 

Therefore, combining (I7.48h and (I7.49p 


N 


7^ Z] Z - 7)]^ 

V i=i t=l 


\PiQz^2,NP2 - P\&Z^2,NP2\ < | 72 ,W - S2,Ar|7 ^1^2CA^7) 

= Op (\//i,/i 2 GA;liog(pjvr)) ^ ||n (7 - 7 )|| + Op (^hh 2 ax-^^Niog(j,NT)) ^||n (7 - 7 ) 11 " 

= Op(l), 

where the last equality is due to Assumption [U]^b)(iii)-(iv), which establishes (|7.44l) . 

Next, turn to (I7.45p . Note that 

\Pi&Z^2,NP2 — p'i&Z^2,NP2\ < ||^2,Af “ ^2,N Ilooll®^7'l||i \A2- 
Given ()7.36D . it suffices to consider 


S 2 AT — S 2 A? = max max 
°° l<l<pl<k<N 


= max max 
l<l<pl<k<N 


N T 


i=l t=l 
1 ^ 


By subgaussianity. Assumption [3l we can use the same technique as in (|8.3p in Proposition [3] 
in Appendix B to prove ^ ^ < qt for some positive constant B. 

Using Markov’s inequality, we have for any e > 0 




max max 
\l<l<pl<k<N 


1 

T 


T s p N z T 

- Hzk,t,i4,t]) > e) < Z Z^ (\7f;J2^Zk,t,i4,t - Hzk,t,i4,t]) 


t=l 

1 


<T.T .—— 

l=l k=l 


2/3 


pA)e2/3 


< BpNTe 


p N 

lE 

l=l k=l 

-De2/3 


t=l 


> e 
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Set e = y/M{log{pNT))^ for some M > 0 and note that the upper bound of the preceding 
probability becomes arbitrarily small for N and M sufficiently large. Thus, 


max max 
l<l<pl<k<N 


= Op {log{pNT)f) 


t=i 


and so 


1 


\^ 2 ,N - L = — 7 = max max 


y/N l<l<pl<k<N T ^ 




'(iog(pivr))= 


N 


(7.50) 


In total, 


\p[Qz^2,nP2 — Pi&z^2,nP 2\ = Op 


'(iog(pvivvr))3/ii/i2GA;l 


N 


— Op(l), 


where the last equality is due to Assumption [6Ka)(ii), establishing (j7.45p . 

We now establish (j7.46p . 

\Pi&Z^2,NP2 — PiQz^2,NP2\ < || T2,Ar ||oo || (©Z “ &z)Pi\\lV^ 

= \\^2M\ooOp{GXl-S^/J^2) = 0{l/VN)Op{GXl;-S^^ 

where the first equality is due to (17.431) . the second equality is due to the definition of S 2 ,Ar and 
m, and the last equality is due to Assumption [6Ka)(ii) and (1)3). Thus, we have established 
(I7.46p . concluding the proof of that (|7.31l] is Op(l). 


(\Lm is Op{l): 

We now prove that (I7.32p is Op(l). First, 

1/52^3,AfP2 - P2^Z,NP2\ < \\^3,N “ S3^Ar||^ /l2 < /12 (||Al3,Af “ S3,Ar||^ +||S3^Ar - S3^Ar||^) , 
where S 3 ,tv := y Eili J2t=i We consider \\t.3,N - S3,Af||oo 




S3,AT — S3,TV 

= 

00 


NT NT 


i=l t=l 


< 2 


N T 


i=l t=l 
N T 




i=l t=l 


i=l t=l 
1 


(7.51) 


Consider the first term of (I7.5ip . A typical element of ^ di,td[ tix ~ x) is 


NT 


i=i 


N T 


i=l t=l 


T 

1/2 


NT 


1/2 , 1/2 

Tf S - 7) < ^ (S 4i4»4) ( Si4(7 - 7)1^ 


S=i 


f I] ) ^i|n(7-7)|| = 


t=i 


1/2 


Vt 


|n(7- 


for some l,k G {1,...,A^}, where the inequality is due to Cauchy-Schwarz inequality. By 


Assumption El we have lP(|gT;,| > e) < lP(|g 7 t | > e^G) < for every e > 0. It follows from 

Lemma 2.2.1 in van der Vaart and Wellner ( 199(tI 1 that ||e|t||i/.i < (1 + Kl2)jC for all i and t. 
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Hence, by subadditivity of the Orlicz norm and Jensen’s inequality, < 

(2 + iir)/C. Using the definition of the Orlicz norm, we have Eexp( 2 ^|efi — E[e?^]|) < 2. Use 
independence of £i^t across t to invoke Proposition [2] in Appendix B for D = and a = 1/3 
to conclude 



for positive constants A and B. Setting e = fQj. gome M > 0 , one has 


> Te < 




T NT 

^- ^t=l ' k=l ^ i=l 

The upper bound of the preceding probability becomes arbitrarily small for N and M sufficiently 
large. Hence, 


max 

l<fc<iV 


=Op(J 

^ t=i ^ 


'(log 


(7.52) 


Furthermore, since maxi<fc< 7 vmaxi<t<TE[e^J < maxi<fc< 7 v maxi<t<T < (l+iF/2)/C' = 

0 ( 1 ) 


max 

l<fc<Af 


1 ^ 


i=l 


< max 
l<fc<Ar 


f ) + 0 ( 1 ). 

(7.53) 


t=i 


Therefore, 


N T 


E E - 7) 


i=l t=l 


= 0r, 


( '‘°r/r'‘ vl)^||n(7-7)l|. (7.54) 


Now consider the second term of (|7.5ip . A typical element of ^ Ylt=i t Ki t(7 “ 7)] 


IS 


N T 


fEE di,t,idi,t,k[7rlt(7 - 7)f < M7,iJ^^i,t,fc|^l|n(7 - 7)f = ^||n(7-7)f, 

i=i t=i — 

(7.55) 

uniformly over I, k G {1,..., N}. Combining (17.541) and (I7.55p . we have 


^3,N — Es^jv 


= 0 . 


V V i) ;^lin(7- 7)11 + /lin(7 -7)11' 


(7.56) 


Next, consider 


Es^n — Es^jv 


E3,n — Es^jv 




N T 


oo l<K7Vl<fc<Arlr 


i=l t=l 


= max 
l<fc<A7 


TE(4i“®[4i]) =Op(i/ 

^ t=l ^ 


'(log N)'^ 


(7.57) 
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where the last equality is due to (17.521) . Summing up (I7.56|) and (|7.57l) yields 


|P2^3,iV/52 - P2^3,iV/52| 

= ftjOp V i) ^l|n(7- 7)11 + ft2i||n(7 -7)11" + Op 

= Op(l), 

where the last equality is due to Assumptions [6|^b)(v), which, in turns, implies that (17.321) is 

Op(l). 

Thus, we have proved (|7.29p . ()3.2ip then follows trivially since the conclusions of Theorem [1] 
and Corollary [T] are uniform over the set ^”( 51 , v, E) and the true parameter vector only entered 
the above arguments when these results were used. 

7.4.2 Numerators of ti and t'l 

We now show that the numerators of ti and are asymptotically equivalent, i.e., 

\p'eS-^U'e - p'eS-^U'e\ = Op(l). (7.58) 


Note that 

|p'es-‘n'E- p'es-‘n'E| < ||/(e - e)||i||s-‘n'E||„ = ||/,(ez - ez)llil|S"‘n'E|u 

= OAa\];±s/h[) ||Z'e||^ V ^ IID'eII^) = 0,(G\'-l^,)0, (\og{pV N)f) = 

where the second equality is due to (I7.43j) . and the third equality is due to (17.121) and (|7.13p . 
and the last equality is due to Assumption EJ) a)(hi). 


7.4.3 t'l A iV(0,1) 

We now prove that is asymptotically distributed as a standard normal by verifying (i)-(iii) 
of Theorem [5] in Appendix B. Note that 


Eh Eh 


t'i-= 


p'QS-^H'e 

Vp'BSneBV 




p'es-^ e;=i 


djEj 


Vp'BSn.e'/9 


Vp'BSneB'p 


where k := NT. In the proof of Lemma [SJ we have shown that is a martingale difference 
array with variance 

var(t'^) 

where we have used the definition of Sne. We have already shown in (I7.28P that the denominator 
of t'l is uniformly bounded away from zero. Thus, verifying that t'l satisfies (i) and (ii) of 
Theorem [5] in Appendix B is equivalent to verifying that the numerator of t'l satisfies (i) and 
(ii) of Theorem 0 First, note that 


= E[t'{] = 


/9'05-ilE[n'ee'n]S-i0V 

p'QEneQ'p 


= 1 


Phz\\i 





j&Hi 

1 NHi 



0{^/hdEL), 


(7.59) 
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where the last equality is due to (17.271) . Next, 


p'Qs-u 


< 


p'l&z 




+ 




< \ h^GX 


- 1 } 




max — . 

node^^l^p 


y/NT VT 

\\P2 ||oo I 




max 




1<«<P y/NT 


+ 


\\P 2 llool^ijiI 

VT 


+ 


VT 


where the last inequality due to (|7.59|) . We have already shown in the proof of Lemma [S] that 
Zi,t,Gi,t has uniformly bounded ^/^i-Orlicz norm. The same is the case for £i^f Hence, 


hiGXj, max 




^<i<p VTTT 


+ 


||P 2 ||oo|£^i,t 


VT 


< 


'hiGX-^,, 


pi 


,-•3 


< 

r\-i 


NT 


log(l +p) m;^ + 


IIP 2 II 0 C 

VT 


NT 




max Zi t,iei,t 

i<i<P 


l|P2||c 


Pi 


IIP 2 II 0 C 


,-3 


pi 


< 


' hlGXn^,, 

NT 


\og{l+p) + 


VT 


for all i and T, where the first r ate inequality is due to Lemma 2 . 2. 2 in I van der Vaart and Wellner 


( 1996I L Using Lemma 2.2.2 in van der Vaart and Wellner ( 1996l i one more time, 
p'es-^( ) <iog(i + ivr) 


max max 
l<i<7V l<t<T 


pi 




NT 


■log(l +p) + 


VT 


= 0 ( 1 ), 


where the last equa lity is due to Assumption 


random variable U (jvan der Vaart and Wellner 


a)(iy )-(v). Since \\U\\Lr < for any 


199fil i. p95), we conclude that (i) and (ii) of 


Theorem [5] are satisfied. 

We now verify (iii) of Theorem [5l That is, 


djEj 


Ei=i p'®s~ 


p'e 


^l,N z^2,N 
'^2,N ^3,Af 


0V 


p' 0 Sn. 0 'p 


p'0Sns0'p 


1 . 


Since we have already shown in (17.281) that the denominator of t'l is uniformly bounded away 
from zero, it suffices to show 


Si V S 
-‘2,N ^3,AT 

The left-hand side of (17.601) can be bounded by 


p'e ( ) e'p - p'exuse'p = op{i). 

\ Zjo at / 


(7.60) 


W V 0/O-p0Sne0/9 

' ^2,N ^3,N ' 

< |p']^0zSi^7V0Z/5i — p'i0^Si^7V0z7’i| 

+ 2|pi0zS2,AfP2 — /5i0zS2,Arp2| 

+ |P2^3,ArP2 - (^V^3,NP2\- 

Thus, we establish that ()7.6ip . (|7.62l) and (17.6,81) are Op(l). Consider (|7.6ip first. 

|p'l0zSl,Af0'^pi — p']^02Si^7v0z/5i| < ||Si,Ar — Si^Ar||^ l|0zPl|ll 


(7.61) 

(7.62) 

(7.63) 


= o„ 


(iog(p 2 r))- 

N 


0 (/^iGA-l) = 0 ,( 1 ) 
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where the first equality is due to (I7.59P and (I7.42p . and the last equality is due to Assumption 
[6ja)(i). Now consider (I7.62p . 

\p'lQZ^2,NP2 — p'l^z'^2,NP2\ < 11^2,TV — S2,Ar||^ l|0zPl|ll IIP2||l 

/ / (log(pJVr))SAiA2GA;l 

where the first equality is due to (j7.5nD , and the last equality is due to Assumption [6|^a) (ii). 
Finally, consider (|7.63p . 

|P2^3,ArP2 - p'2^2,,NP2\ < 11^3,AT - ||p2||l = Op [\j ^ 0(/l2) = Op(l), 

where the first equality is due to (|7.57l] . and the last equality is due to Assumption [6fb)(v). 
Therefore, we have established (17.601) and t'l is asymptotically standard gaussian. 



7.4.4 t2 = Op(l) 


Last, we prove that t 2 = Op(l). Since the denominator of t 2 is bounded away from zero by 
a positive constant with probability approaching one by (17.281) and (17.291) . it suffices to show 
p'A = Op(l). 

\p'^\ = \'^Pj^j < Vhin&^lAjl < v/L||5(7-7)||imc^||0'T7v-lj,+^^^ 
j&H 


= \/L||S'( 7 - 7 )||i ( 


= \/L||5(7-7 )||i ("max ( 


V max 
i&H2 




max 

\j&Hi 


1 


NT 


Z' ZQz,j ~ Cj 


V 

^^D'ZSzi 

1 V max 

^^Z'D 

oo 

tVn 

ooj 

Ty/N 


< 


V^I|5'(7 - 7)l|i (^max ^ -^Z'ZSz,] - ej ^ v||0z7||^ 


Ty/N 


D'Z 


V max 
i£H2 


tVn 


Z'D 


where 0j is the jth row of 0 but written as a (p + N) x 1 vector, and Ip+TV,j is the jth. row of 
Ip+ 7 V but written as a (p + N) x 1 vector. Note that 

^node 


max 

j&Hi 


1 


NT 


Z'ZQzj ~ Cj 


< max ■ 

oo j&Hi f- 


— Op{Xnode) 1 


where the inequality is due to the extended KKT conditions 
(I3.15p . Recall that by (|7.15l) we have that for every e > 0 


, and the equality is due to 


P( 


max max 


l<i<Ar l</<p y/NT ^ 


1 




N p 


i=l 1=1 


1 


ViVT^ 




> e) < ApNe 


-bAn 


for positive constants A,B. Setting e = > 0) makes the upper bound of the 

preceding inequality arbitrarily small for sufficiently large N and M, such that 




Ty/N 


D'Z 


= 0, 




Thus, |/o'A| = Op(l) by Assumption EJc) . For later reference, 

sup |p'A| = Op(l) 

'yeT{si,u,E) 

by the same reasoning leading to the uniform validity of (|3.2ip . 


(7.64) 

□ 
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7.5 Proof of Theorem [3] 

Proof of Theorem 0 For every e > 0, define 


^i,7V := 


sup |/ 0 ^A| < e 


M,n '■= 


sup 

'yeT(si,u,E) 


f/eTusQ'p 


Vp'BSneBV 


- 1 


< e 


A 3 ,n ■■= {|p'B5-^n'e - p'B5-^n'e| < e} 


By (j7.64D . (|3.21l] . (|7.28l] and (j7.58D . the probabilities of the preceding three events all tend to 
one. Thus, for every t G M, 


P 


< 


p's (7 - 7 ) 


P 


p'BEn.B'p 

p'BS-^n'e 


<tj- m 

p'A 


< t, Ai^n, ^2 ,tv, ^3,Af — ‘h(t) 


We consider P 

^ ( p'es-^u'e 


p'BSn.B'p ^p'BSn.B'p 

p'es-in'e p'A 


+ ¥[uUAI^). 


V P'0Sne0'p ^/p'Btue^'P 

p'A 


= < ^1,TV, ^2,Af, ^3,Af ) first. 

'p / 


< t, Ai^n, -fi2,N, ^3,TV 


A 3 


< P 


p'BSn.B'p <^p'BSn.B'p 

p'BS'^^n'e e + e 

^ =<t{l + e) + —= 

Vn — ^ ^ ' /77Z 


Vp'BSn.B'p 


Vp'BSn.B'p 


< 


p'es-^u'e 

Vp'BSn^B'p 


< t{l + e) + 2De 


for some positive constant D, where the first and second inequalities are due to the fact that 
p'BSneB'p is uniformly bounded away from zero, see (j7.28p . Since the last inequality in the 
above does not depend on 7 , 


sup P 

7STF(si,i/,_E) 


p'A 


p'QS-^U's 


p'OTnsQ'p V/9'BSn.B'p 


< t, ^l,TV, ^2,AT, ^3,TV 


< 


p'es-m's 


< til + e) + 2De] . 


Vp'BSn.B'p 

By the asymptotic normality of for N sufficiently large, 

p'QS-^U'e p'A 


sup P 

'yeE{si,u,E) 


p'QTueQ'p Vp'BSneB'p 


< f, ^ 1 ,TV, ^ 2 ,TV, ^ 3 ,TV < + e) + 2,De) + e. 


As the above arguments are valid for every e > 0, we can use the continuity of g 1 —)• <h(g) to 
conclude that for every (5 > 0 , one can choose e sufficiently small such that 


sup P 

'y&T{si,v,E) 


p'A 


p'QS-^U'e 


p'Q^UeQ'p yJp'e^nsQ'p 


< f, ^i,TV, ^ 2 ,TV, ] < <I>(t) + (5 + e. (7.65) 
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We next find a lower bound for 
^ ( p'QS-^U'e 


p'es-^n'e 

p'A 


/ A, ^ t,^l,7V,^2,iV,^3,iv)- 

Vp 0Sn£0'p / 


> 


> 


> P 


p'0Sn.0'p \lp'QT.neQ'p 
p'QS-^Ii'e 
Vp'0Sn.0V 
p'05-in'e 
Vp'0Sn.0V 
p'QS-^Ii'e 
Vp'0Sne0'/9 


< ^2,Af, ^3,TV 


e + e 


< t(l - e)- ^ =, Ai,Ar, ^ 2 ,AT, ^3,iV 

V/o'0Sn£0'/9 

< t(l — e) — 2L)e, Ai^tv, ^ 2 ,Ar, ^3,iV^ 

< f(l — e) — 2Z)e^ + P(n^_]^j4j^7V') ~ 1 


for some positive constant D, where the first and second inequalities are due to the fact that 
p'0Sn£0V is uniformly bounded away from zero, see ()7.28l) . Since the last inequality in the 
above display does not depend on 7, and F{r\f^iAi^]y) can be made arbitrarily close to one for 
sufficiently large N, 


inf P 

'YeT{si,iy,E) 


p'es-^u'e 


p'A 


< t, Ai^n, ^2 ,at, ^3, 


N 


> 


p'OJ^neO'p \/p'ej:neQ'p 
p'es-^u'e 


< f(l — e) — 2De — e. 


Vp'0Sn.0'p 

By the asymptotic normality of for N sufficiently large, 

^ p'05-in'e p’A 


inf P 

'yGT{si,u,E) 


p'GJ^ne&'p yJp'e^ueG'p 


< ^i,Af, ^2,7V, ^3,Af > (t(l — e) — 2De) — 2e. 


As the above arguments are valid for every e > 0, we can use the continuity of g 1—)• <h(g) to 
conclude that for every <5 > 0, one can choose e sufficiently small such that 


inf P 

-yeE{si,u,E) 


p'A 


p'QS-^U'e 


p'QJ^ne&'p Jp'OJ^neO'p 


< t, A2,Ar, As^tv > (f) — 5 — 2e. (7.66) 


Thus, bv (17.6511. (I7.66p and the fact that sud-.^t-^^ = IP(uiiAf,^) = 0(1), we 

have proved (|4.ip (the uniformity over t G M follows from the fact that <l>(t) is continuous). To 
see (1121), note that 


aj ^ 


< 1 -P 


— A-S/2 


cr. 




^/Nf' 


aj + Zi_^i2 


a, 


a,3 


/ ^/NT{dij — aj) 

\ ^z,3 


< ^1-5/2 ) + 


y/NTi 

p ( - aj) 


/ y/NT{aj — aj) 
V 


> Zi-5/2 


< —^1-5/2 


Thus, taking the supremum over 7 € J^{si,e,E) and letting N tend to infinity yields (|4.2p via 
The proof is the same for (|4.3p . Next, we turn to p4.4p . 


sup diam 

'yeE(si,i/,E) 


y/NT' ^ 'WnT^ 


aj - zi_si2-j=, aj + Zi_s/2 
= 22 ^ 1 - 5/2 {^\l[Gz'^l,NGz]j3 + O. 


a, 


a ,3 


< ‘^A-512 


A/maxeval(Si^Ar) 

mineval(T2) 


+ Op{l) ] — Op{l), 
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where the first equality is due to (I3.21D . and the last equality is due to Assumptions IHa) and 
[U^d). Similarly, we can prove (14.51] : 


Vt sup diam 

'yeJ^{si,u,E) 


= 2z 


l-S/2 


Vi - Zi-S/2-^,Vi + Zl-S/2 ■ 

1 l'-'" \ 

+op(i) =Op(i), 

t=i J ^ 


r},i 


— 25^1-5/2 ( Y + O; 


where the third equality follows from the arguments above (|7.53p . 


□ 


8 Appendix B 

Proposition 1. Let A and B be two positive semidefinite {p — 1) x {p — 1) matrices and 
6 := maxi<;^fc<p_i \Aik — Bik\. For any integer r G {1,... ,p — 1}, one has 

K^{B,r) > K^{A,r) — 616r. 


Proof. The proof is exactly the same as that of Lemma El 


□ 


Theorem 4 (jFan et al.l ([2013) )• Let a G (0,1). Assume that is a sequence of 

2a 

supermartingale differences satisfying supjE[el^'l < Ci for some constant Ci G (0, oo). 
Define ■= Xi. Then, for all e > 0, 

P ( max Sk > ne) < (7(0, n, 

\l<k<n / 

1 f 3(1 — ct)\ 


where 


C{a, n, e) := 2 + 35(71 


+ 


16^““ (ne^)" ne^ 2a 

The preceding theorem is not exactl y the sam e as T heorem 2.1 in Fan et al. ( 2012I L but 
taken from the proof of Theore m 2.1 in iFan et al.l (|2ni2l l. This theorem generalises Theorem 
3.2 in Tvesigne and Volnv ( 200 il l. 

Proposition 2. Let a G (0,1). Assume that (Aj, J7)r=i ® sequence of martingale differences 

2a 

satisfying satisfying supj ~“ ] < (7i for some positive constant D. (Ci could change with 

the sample size n.) Then, for all e> 


p 


2 = 1 


> ne ) < ACie 


-K{e'^n)° 


for positive constants A and K. 


Proof. This proposition is a simple adaptation of preceding theorem. Note that for some positive 
constant D, 


p(^Ai > ne) =F(j2D^Xi > nD^ej = 

2=1 2 = 1 2=1 


n 

(E>< 


> nS) , 


where Yi := D Xi and 5 := D 2 a e. Now (1^)^;^ is a sequence of martingale differences 

2a 

satisfying supjE[el^*l < Ci. Invoking the preceding theorem, we have 


n 

(E^. > nd) < C{a,n,5)e 


2=1 
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(—is also a sequence of martingale differences satisfying the same exponential moment 
condition. Thus, 




> ne = 


2=1 


n 

> nd'j < 2C{a,n,6)e 


2=1 
1 —Q 


= < ACie 


-Xe2° 


for positive constants A, K, where the last inequality used that if e > then 2C{a, n, D e) < 
ACi for some positive constant A. □ 

Proposition 3. Suppose we have random variables uniformly subgaussian for I = 1,..., L 

(L > 3 fixed), i = 1,... ,N, t = 1,... ,T and j = 1,... ,p. Both p and T increase with N 
(functions of N). and Zi^^i^^t 2 ,j 2 independent as long as ii i 2 regardless of the 

values of other subscripts. Then, 


max max max E 
l<i<N 


L/ 

<A = 0{1), 

1=1 


( 8 . 1 ) 


for some positive constant A and 

NT L 


max 

i<i<p 


1 

Wt 


(n ® [n 


i=i t=i 1=1 


1=1 


= o„ 


(iog(pr))^+i 

N 


( 8 . 2 ) 


Proof For every e > 0, P (inf=i > «) < YlLi^ for posi¬ 

tive constants K, C. Next, using Holder’s inequaliy, we have 


max max max E 
^<j<P^<t<^T l<i<N 


1=1 


i,t,j 


< max max max I I (E \Zj j + i 

- l<j<pl<t<Tl<i<Ni-i-\ L' ” 
-- 


Uniform subgaussianity implies that (E 


\ y \L 


is uniformly bounded. That is, (E 


I 7 \L 


< 


< L\{\o^2)~^ /^\\Zi i t A\,h^ < L!(log2)~^/^ Izzs^) > where the hrst two inequali¬ 
ties are taken fr om p95 of van der Vaart aiic Wellner ( 199o l. and the third inequality is due to 
Lemma 2.2.1 in van der Vaart and Wellned ( 1996I L (|8.ip then follows. 

For every e > 0, 


T L 


(n - E [n ) I > e) < p (i™ |n 


t=i 1=1 
T L 


> e-A 


1=1 


1=1 


< ^P (]n >e-A^e)< < TK'e-^^"^\ 


t=i 1=1 


for K' = and where the second last inequality is due to subadditivity of the concave 

function: (x+y)"^^^ < for x, y > 0, L > 3. Let Xjj denote ^ Y^=i (IlLi ~ Efllti ^idTj]) ■ 

Consider some positive constant D < C. 


E 




/ / 

J xem j 0 


|a:|2/i 


< [ TK'De^^-^'>^ds + 1 = + 1<BT, 

-Jo C-D - ' 


De^^dsP{dx) + 1 = / De^^¥{\Xij\ > s^/^)ds + 1 

Jo 

TK'D 


(8.3) 
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for some positive constant B, where the second equality is by Fubini’s theorem. Then we can 
use independence across i to invoke Proposition [2] in Appendix B with a = and Ci = BT, 


for e > ^ 


1 

yiv’ 


N ^ T L 


|E f E (n - E [n ) I s 

i=i t=i 1=1 1=1 

for positive constants A' and K. Setting e = ^ for some M > 0, we have 
N T L L 1 1 

** (gg‘ !e ^ E (n [n ^'■<■‘4 ) 12 s = AipTy-f-'x^. 


a<Kpl^ _ 

- i=l t=l 1=1 


The upper bound of the preceding probability becomes arbitrarily small for N and M sufficiently 
large. Hence (18.21) follows. □ 


Lemma 9. Let A be a symmetric p x p matrix, and v and u G M^. Then 

\v'Av — v'Av\ < ||A||oo||u — u||f + 2||Au||||'D — u||. 

Proof. See Lemma 6.1 in the working-paper version of van de Geer et al. ( 2014l b 


□ 


Theorem 5 ([McLeishl (jl974l lL Let {Xn^i,i = be a martingale difference array with 

respect to the triangular array of a-algebras {Fn,i,i = 0, ...,/cn} (i.e., Xn,i is iFn,i-measurable 
and '&[Xn^i\Xn,i-i] = 0 almost surely for all n and i) satisfying Tn,i-i C Tn,i for all n > 1. 
Assume, 

(i) maxj<fc^ I Ain,i I is uniformly bounded in L 2 norm, 

(ii) maxi<fc^ \Xn,i\ A 0, and 

(«•) JltiKA 


Then, Sn = Yli=i ^n,i ^{0, 1) as n —>■ 00 . 
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